AEO Tool Selection Criteria: 7 Capabilities That Narrow 27 Vendors to a Short List
In this article, you will learn which 7 capabilities decide whether an AEO platform belongs on your short list, how to weight each capability against your agency or in-house situation, and where the discovery-stage decision ends and the procurement-stage due diligence begins.
The discovery-stage problem
The AI brand visibility market has 27 vendors at last count, led by Profound at a $1B valuation on $155M total funding, AthenaHQ on a $2.2M seed, and a long tail that includes Otterly, Peec AI, Scrunch, Brandlight, and Evertune (Ekamoira, 2026). Every vendor claims they track AI engines. Every vendor claims they deliver insights. The marketing pages converge.
Discovery is not procurement. Procurement is what you do once three or four vendors sit on a short list and you run each through a detailed methodology questionnaire. That procurement-stage work lives in our 13-question vendor due diligence checklist. This article is upstream. The job here is the 7 capability categories that decide who makes the short list in the first place. Weights are defaults. Adjust them to your situation.
The 7 capabilities, weighted
1. Methodology disclosure depth (weight: 25%)
What it is: Whether the vendor publishes how their numbers are produced, including prompt design, engine list, sampling cadence, ranking math, and bias mitigation. Disclosure depth ranges from a marketing page that says "proprietary methodology" to a published methodology document that names every engine, weight, and statistical model.
Why it matters at short-list stage: Disclosure is the cheapest filter you can apply. A vendor who will not disclose at discovery will not disclose at procurement, and the gap becomes a renewal risk when a client asks a question you cannot answer. Full breakdown in our methodology disclosure checklist.
An AEO vendor's willingness to publish their engine list, prompt schema, and ranking math at the discovery stage predicts whether they will survive procurement-stage scrutiny. Profound's $1B valuation has not been accompanied by a published independent methodology audit, which the AI visibility market broadly shares as a gap (Ekamoira, 2026). ai visibility market landscape
Weighting guidance: 25% for agencies serving sophisticated B2B clients. 15% for in-house teams who can negotiate methodology disclosure under NDA after the short list narrows.
2. Engine coverage and disclosed weighting (weight: 15%)
What it is: Which AI engines the platform queries (ChatGPT, Claude, Gemini, Perplexity, Copilot, Grok, Meta AI), how many of them, and whether the vendor discloses how the composite score weights each engine.
Why it matters at short-list stage: A score from one engine is a one-engine ranking, not a market reading. ChatGPT held 68% of AI assistant share in early 2026, Gemini 18.2%, Perplexity rising 370% year over year (FirstPage Sage and Similarweb, 2026). Three engines covers roughly 90% of the market. One engine covers a slice.
Weighting guidance: 15% baseline. Push to 20% if your client base sits in regions where Gemini, Perplexity, or Copilot have above-average share.
3. Ranking math: pairwise versus absolute (weight: 20%)
What it is: Whether the vendor produces rankings by sorting absolute mention counts (the default) or by running pairwise comparisons aggregated through a statistical model (the LMSYS Chatbot Arena approach). The choice has substantial implications for whether month-over-month movement reflects signal or sampling noise.
Why it matters at short-list stage: Fishkin and O'Donnell ran 2,961 identical prompts in early 2026; fewer than 1% produced the same brand list. Absolute rankings on data that noisy are unstable. Pairwise math handles the noise the way it does for AI model evaluation. Large enough a distinction to drive short-list decisions, not procurement ones. Full explanation in our pairwise ranking article.
Pairwise ranking is a statistical method (Bradley-Terry, 1952) that estimates relative strength from many head-to-head comparisons. The same family of methods produces the public AI-model leaderboards cited by frontier AI labs. Applied to AEO, it converts unstable absolute rankings into stable relative ones. share of model
Weighting guidance: 20% for sophisticated buyers. 10% for teams who will accept absolute ranking with broader confidence intervals reported alongside.
4. Blind versus named prompt design (weight: 15%)
What it is: Whether the vendor's measurement prompts mention the target brand by name (named, contaminated by sycophancy) or describe the category without naming any specific brand (blind, measures organic visibility). The Banks 2026 sycophancy experiment quantified the gap at +22.5 percentage points of mention inflation when prompts named the target brand (Banks, 2026).
Why it matters at short-list stage: Most platforms use named prompts because they are easy to configure ("show me how Brand X appears in AI"). The output is closer to "did our prompt contain Brand X" than "does AI cite Brand X organically." For competitive intelligence rather than self-reinforcing dashboards, blind prompts are non-negotiable. See blind vs named measurement for mechanics.
Weighting guidance: 15% baseline. Raise to 25% if your reporting will be reviewed by clients with research backgrounds.
5. Confidence intervals and uncertainty reporting (weight: 10%)
What it is: Whether the platform reports a single rank number or reports the rank with an uncertainty band around it. A brand at position 4 with overlapping confidence intervals across positions 3 to 5 is genuinely position 3 to 5, and a vendor who hides that ambiguity is selling false precision.
Why it matters at short-list stage: Confidence intervals separate defensible reporting from marketing math. A platform that ships rank positions without uncertainty cannot answer "is this month-over-month movement signal or noise." That question will get asked.
Weighting guidance: 10% baseline. 15% if your client contracts include performance benchmarks tied to AEO movement.
6. Output format fit: dashboard, API, or report (weight: 8%)
What it is: How the platform delivers its data. Some ship a dashboard with login access. Some ship a flat CSV or API endpoint. Some ship monthly PDF reports. The format decides whether the platform fits an agency's existing client reporting stack or in-house BI tooling.
Why it matters at short-list stage: A workflow decision, not a methodology one. It does not affect whether the numbers are correct. It does affect whether they reach the right people without weekly copy-paste work. An agency running 12 retainers needs API. An in-house team with one brand may be fine with a dashboard.
Weighting guidance: 8% baseline. 15% for agencies running more than 10 client accounts. 5% for in-house teams with one brand to monitor.
7. Pricing model fit (weight: 7%)
What it is: Whether the vendor prices per brand, per query, per seat, per workspace, or on enterprise contract. Each model creates different unit economics. Per-brand pricing scales linearly with an agency's client count. Per-query pricing rewards efficient prompt design and punishes sampling depth. Per-seat pricing is friendly to small teams and hostile to enterprise scale.
Why it matters at short-list stage: A pricing model mismatch bleeds slowly. A platform priced per brand at $500 per month is fine for an in-house team and ruinous for an agency adding a 15th client. Filter at discovery so procurement does not surface a deal-breaker after three weeks of demos. AthenaHQ's $2.2M seed sized them for SMB economics. Profound's enterprise pricing reflects 700+ Fortune 500 customers (Profound, 2026). Price encodes the buyer the vendor is built for.
Weighting guidance: 7% baseline. 12% for agencies. 5% for in-house teams.
How the weighting math works
Multiply each criterion's score (0 to 10) by its weight, sum across criteria, normalize the weights to 100%, and rank vendors by total. A spreadsheet works. The goal is not statistical precision. The goal is making the weighting explicit so the short-list decision is reviewable later. Agency weights favor output format and pricing model. In-house weights favor blind prompts and engine coverage. Same criteria, situational weights.
What this framework deliberately excludes
This is a discovery-stage checklist. Three things sit outside its scope. Procurement-stage methodology and contract questions live in the 13-question vendor due diligence checklist. Deep methodology audits live in the methodology transparency article and the methodology disclosure checklist. Specific head-to-head vendor comparisons are not here because the SERP already runs heavy on listicles, and listicle rankings do not survive sophisticated procurement review.
What the short list should look like
A defensible short list is three to four vendors, each scoring above 6 of 10 on the heavy-weighted criteria (methodology disclosure, ranking math, blind prompts), at least adequate on the medium-weight ones (engine coverage, confidence intervals), and viable on workflow (output format, pricing). Anyone failing the methodology screen at 6 or above should not advance. Anyone who passes but fails on ranking math or blind prompts may advance, with the gap flagged for procurement-stage probing.
GenPicked publishes its methodology, uses pairwise ranking, runs blind prompts as the dominant signal, and discloses engine weighting. Those are positions, not marketing claims. Apply the framework to GenPicked the same way you apply it to any other vendor.
Frequently asked questions
What is the difference between AEO tool selection criteria and AEO vendor due diligence?
Selection criteria are the 6 to 8 capability categories that narrow a universe of 27 vendors to a short list of 3 or 4. Vendor due diligence is the 13 detailed methodology and contract questions you ask each short-listed vendor before signing. Selection criteria are discovery-stage. Due diligence is procurement-stage. The two checklists live in separate articles because they answer separate decisions.
Should I use a listicle-style "top 10 AEO tools" article to choose a vendor?
Listicles work for awareness, not for buying decisions. The ranking in a listicle reflects the author's editorial preferences, the affiliate relationships of the publisher, or both. A criteria framework lets you apply your own weights to your own situation and produce a short list that is defensible internally. The criteria above are designed to do that.
How many AEO vendors should I shortlist?
Three or four. Fewer than three risks a procurement process that pressures you into the one vendor who made the cut. More than four creates demo fatigue and dilutes the procurement-stage methodology questioning. The sweet spot is three named vendors plus one "wild card" who scored well on a specific high-weight criterion you care about.
Do agencies and in-house teams use different criteria?
The criteria are the same. The weights differ. Agencies care more about output format fit (API access matters when you serve 12 clients), pricing model (per-brand vs per-seat changes the unit economics), and methodology disclosure (clients ask questions). In-house teams care more about blind prompts (the data is for internal decisions, not client reports) and engine coverage (you need market-wide reading, not portfolio reading).
What is the single most important criterion?
Methodology disclosure depth. A vendor who will not disclose how their numbers are produced cannot survive sophisticated procurement, cannot defend renewal under client pushback, and cannot help you when month-over-month movement gets questioned. Every other criterion matters. This one filters.
How long does a discovery-stage short list take to build?
Two to four hours of focused work, assuming you have already skimmed the marketing pages of candidate vendors. Procurement-stage due diligence then takes two to four weeks because it involves vendor demos, methodology document review, and contract negotiation.
Related reading
- 13-question AEO vendor due diligence methodology checklist
- AEO tool methodology disclosure checklist
- Why most AEO tools will not show you their engine weights
- Pairwise ranking for AEO measurement
- Share of Model: the AEO metric everyone wants
The buyer journey from here
Next stop on the buyer journey is the procurement-stage work: the 13-question vendor due diligence checklist. After due diligence, the deepest vendor evaluation question becomes methodology disclosure, covered in the methodology disclosure article.
If you want to see what a vendor that scores well across all seven criteria looks like in practice, run a free GenPicked AEO audit on your brand or a client's brand. We will run the scan with disclosed methodology, blind prompts, pairwise ranking, and confidence intervals reported alongside.
Start your 14-day free trial of GenPicked Growth
Dr. William L. Banks III is Founder of GenPicked. Citations to Profound (2026), Ekamoira (2026), Fishkin and O'Donnell (2026), Banks (2026), and the AI market share data from FirstPage Sage and Similarweb are documented in the GenPicked research wiki. Specific source files available on request.