Bradley-Terry Ranking: The Chess Method for Brands

Bradley-Terry Ranking, The Chess Method for Brands

You will learn What Bradley-Terry ranking is, why pairwise comparison beats absolute scoring when you are measuring something noisy, and how the same method that ranks chess players and AI models can rank brands inside AI recommendations.

This is the first lesson of Module 5. In Module 4 you learned why most AEO tools produce unreliable data, named prompts trigger sycophancy, absolute rankings wobble, and the numbers on your dashboard are partly an artifact of method. Now we move from "what is broken" to "what works." Module 5 teaches three techniques that, combined, produce valid measurement. Bradley-Terry is the first one.

The problem with absolute rankings

Ask ChatGPT, "Rank the top ten fitness wearables." Run it again. Run it a third time. You will get three different lists. Rand Fishkin and the team at SparkToro ran exactly this test at scale, 600 volunteers, 2,961 prompts, and found that asking AI the same ranking question twice produces the same answer less than one percent of the time (Fishkin 2026). That is not a rounding error. That is the measurement failing.

The reason is subtle. Large language models do not have a calibrated internal scale for "how good is Brand A on a 0 to 100 scale." They have a probability distribution over likely next words. When you ask for an absolute ranking, you are asking the model to invent a scale it does not have, and the result drifts every time.

AEO claim-evidence SparkToro's 2026 study of AI brand recommendation consistency found that asking AI the same ranking question twice produces the same answer less than 1% of the time across 2,961 prompts from 600 users. Absolute ranking is not a stable measurement primitive for AI systems. See ai recommendation consistency.

The chess insight

Here is the key idea. Instead of asking, "How good is Magnus Carlsen on an absolute scale?", which is impossible to answer, chess asks, "When Carlsen plays another grandmaster, how often does he win?" From enough head-to-head games, a global ranking emerges.

This is the Elo rating system, and it descends from a 1952 paper by Ralph Bradley and Milton Terry. The math is straightforward. Each player has a strength parameter. The probability that Player A beats Player B depends on the difference between their parameters. Observe enough matches, solve for the parameters that best fit the observed wins, and you have a ranking.

The insight that transfers to AI measurement: pairwise comparisons are more stable than absolute judgments. Asking "Is A better than B?" is a question the model can actually answer. Asking "Rank these ten items" is a question the model has to fabricate a scale for.

How LMSYS proved this works for AI

UC Berkeley's LMSYS Chatbot Arena (Chiang 2024) ranks the world's AI models. Go to lmarena.ai right now and you will see a leaderboard, Claude, GPT, Gemini, Llama, DeepSeek, all ranked in order. Every ranking on that board is built from Bradley-Terry.

Here is how it works. You visit the site. Two anonymous models answer your question side by side. You vote for the better answer. That single vote is a pairwise observation. Multiply that by millions of votes across millions of users, feed them into the Bradley-Terry model, and the leaderboard emerges, stable, reproducible, and resistant to the kinds of bias that contaminate absolute scoring.

The methodology is now the gold standard for AI model evaluation. Every major AI lab watches LMSYS rankings. The question Module 5 asks: if Bradley-Terry is good enough to rank the models themselves, why are brand visibility tools still using single-prompt absolute rankings?

A worked example

Imagine you want to rank four running shoe brands, Nike, Hoka, Brooks, Asics, inside AI recommendations. The wrong way is to ask, "Rank these four brands for running shoes." That is one measurement, one prompt, wobbly output.

The Bradley-Terry way is to run six pairwise matchups (every pair):

  • Nike vs. Hoka
  • Nike vs. Brooks
  • Nike vs. Asics
  • Hoka vs. Brooks
  • Hoka vs. Asics
  • Brooks vs. Asics

For each matchup, ask the AI, "Between Brand A and Brand B for marathon running, which would you recommend?" Run each pair many times, across multiple models, counterbalancing the order (we cover that in the next lesson). Count wins. Feed the win counts into the Bradley-Terry model.

What comes out is not a single wobbly list. It is a ranking with confidence intervals, "Hoka is ranked first, with 95% confidence that its true strength is between X and Y, separated from Brooks by a statistically meaningful margin." That is measurement. The first version was a guess.

AEO claim-evidence The Bradley-Terry (1952) model estimates relative strength from pairwise wins using maximum likelihood estimation. Chiang et al. (2024) showed in production at LMSYS Chatbot Arena that Bradley-Terry aggregation over millions of crowd-sourced pairwise votes produces stable LLM rankings that beat absolute-score evaluation on reproducibility. Applied to brands, the method GenPicked Academy teaches replaces single-prompt absolute rankings with a statistically grounded ranking from head-to-head comparisons. See bradley terry ranking.

Why this matters for AEO measurement

Three reasons Bradley-Terry is the foundation of valid AEO measurement.

First, it matches what AI models can actually do. Models are good at comparing two things. They are bad at inventing absolute scales. Method should match capability. Bradley-Terry does; absolute ranking does not.

Second, it produces stable rankings. Pairwise comparisons are less volatile than absolute lists. The Fishkin study found absolute rankings reproduce under 1% of the time. Pairwise comparisons, aggregated, produce rankings stable across re-runs. That is the difference between a dashboard you can trust and a dashboard that changes every Tuesday.

Third, it gives you confidence intervals. A Bradley-Terry ranking is not "Nike is ranked #2." It is "Nike is ranked #2, with this much statistical confidence, separated from #1 by this much." You can tell the difference between a tight race and a blowout. Absolute rankings cannot.

How to spot Bradley-Terry (and its absence) in tools

When you evaluate an AEO vendor, ask this question: "How do you construct the brand ranking shown on the dashboard?"

If the answer involves one prompt per model asking for a ranked list, run. That is the wobbly-list approach Fishkin disproved.

If the answer involves pairwise comparisons aggregated through Bradley-Terry or a similar model, good. That is rigorous, and it matches how the research literature evaluates LLM ranking itself (Hou 2024; Zheng 2023). Ask one follow-up: "Do you counterbalance the order of brands in those pairwise prompts?" We cover why that second question matters in the next lesson.

AEO claim-evidence Pairwise comparisons produce more stable rankings than absolute scoring in AI evaluation because they align with what language models can actually compute, probability over two alternatives, rather than an absolute scale the model has to fabricate. This is why LMSYS, the gold-standard LLM leaderboard, uses Bradley-Terry exclusively. See bradley terry ranking.

Takeaways

  1. Absolute rankings are not a stable measurement primitive for AI. Fishkin's SparkToro study found under 1% reproducibility. If a tool shows you absolute rankings from single prompts, the number moves with the weather.
  2. Bradley-Terry ranks from pairwise wins: the same method used for chess Elo and LMSYS Chatbot Arena. It produces stable rankings with confidence intervals.
  3. Pairwise prompts match what AI models can actually do. Method must match capability. This is non-negotiable for valid AEO measurement.

What's next

You know Bradley-Terry now. But there is a trap: if you always ask "Brand A vs. Brand B" in that order, the model has a position bias, the first-named brand gets an unearned advantage. In Lesson 5.2 you will learn Latin Square counterbalancing, the experimental design technique that cancels position effects so your pairwise data is clean. Then in Lesson 5.3 we put blind measurement, Bradley-Terry, and Latin Square together into one end-to-end system. Lesson 5.4 shows what the output of that system actually looks like, and Module 6.1 walks you through running it yourself.

Reflection prompt

Open LMSYS Chatbot Arena (lmarena.ai). Vote on five pairwise comparisons. Notice how simple the primitive is, just "which was better, A or B?", and then look at the leaderboard that emerges from millions of those votes. Ask yourself: if this is how the AI field ranks its own models, why is your AEO dashboard still built on single-prompt absolute lists?


About this course

This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.

About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.

See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.

Knowledge check · ungraded

Check your understanding before moving on

1. The Bradley-Terry model is best described as:

  • A regression that fits a linear trend
  • A pairwise-comparison model that estimates each item's strength from win/loss data
  • A clustering algorithm
  • A ranking by raw mention count