Five Questions to Ask Any AEO Vendor
In this article, you will learn the five diagnostic questions that separate measurement-grade AEO tools from dashboard theatre. For each question, you will learn why it matters, what a good answer sounds like, and how to read evasion as its own signal. By the end, you will be able to run a 20-minute diligence call on any vendor and know whether to trust their numbers.
Where you are in the curriculum
In Lesson 7.1, you built a map of the AEO market, three categories, two sub-dimensions. That map told you where a vendor sits. This lesson tells you whether they belong there. The two lessons pair.
Why five questions, and why these five
Every one of these questions targets a known failure mode in the AEO tool market. They come from the methodology gaps documented across the 27+ commercial tools in the measurement validity crisis, the research pattern that emerged when independent researchers started stress-testing vendor claims.
You do not need to be a statistician to ask them. You only need to listen to how the answer lands. Specific and substantive means the vendor has done the work. Evasive and handwavy means they have not, or they have, and they would prefer you not look.
Question 1: Methodology transparency, "Can you send me your actual prompt template?"
Ask for the literal prompt string. Not a description of it. The string.
A vendor with nothing to hide will send you the prompt the same day. They may redact customer-specific tokens. They will not redact the structure. A vendor who will not share the prompt, even under NDA, is telling you something. Not about their prompt, but about how they think about you as a customer.
Why this matters beyond the obvious: the prompt determines the construct. A prompt that asks "what are the best CRMs for mid-market B2B teams?" measures something. A prompt that asks "what do you know about Acme CRM?" measures something else. They are not the same construct. A tool whose prompt you cannot see is a tool whose construct you cannot verify. That is a construct validity failure, valid measurement, the kind GenPicked Academy teaches, requires construct definition before item generation (Churchill, 1979), and recent audits find most LLM benchmarks fail this basic test (Bean, 2024). The tool may be perfectly precise at measuring the wrong thing.
How to read the answer: if the response is "we use proprietary prompt engineering," that is a non-answer. Proprietary means they have intellectual property in how they prompt. It does not mean they cannot share what they prompt. Push once. If they still decline, move on.
Question 2: Blind vs. named, "Does the prompt mention the target brand by name?"
This is the single most important methodology question in the entire AEO category. Most vendors will not volunteer the answer.
A blind prompt asks a category question with no brand names. "What are the best podcast hosting platforms?" The AI answers from its internal model of the market. A named prompt embeds the brand in the query. "Tell me about Acme Podcast Hosting." The AI answers about the thing you asked about, which is a different question.
Claim-evidence block. Sycophancy, LLMs shifting answers toward user-stated framings, is a documented, systematic effect across major models traceable to RLHF preference optimization (Sharma, 2024), with converging eval evidence of answer-drift when users name their referent (Perez, 2023). In the 2026 Brand Intelligence Gap experiment, this surfaced as brand-anchored prompts inflating organic mention rates by 22.5 percentage points with an odds ratio of 18.5 on mention gain (Banks, 2026). A tool built on named prompts produces scores inflated by a distortion of this magnitude, and the distortion varies across models in ways that do not cancel out when you average.
Why this matters: if a tool uses named prompts, it is not measuring your brand's AI visibility. It is measuring the AI's willingness to talk about something you brought up. Those are different constructs. Named-prompt scores tend to be flattering, which is part of why they sell. Flattering scores are easier to renew.
How to read the answer: "We use both" is the most common dodge. The follow-up is: "When you report my share-of-voice number on the dashboard, which prompt type produced that number?" If they cannot tell you, the dashboard number is an average of two incompatible constructs and is uninterpretable. If they can tell you, and it is blind, that vendor has passed the most important filter in the entire category. See blind vs named measurement.
Question 3: Variance handling, "How many samples per score, and what is the variance?"
AI models are stochastic. The same prompt to the same model at the same temperature returns different outputs on different runs. A single sample is not a measurement, it is an anecdote. A hundred samples with no reported variance is a hundred anecdotes averaged.
Claim-evidence block. Identical prompts yield non-identical answers across repeated calls even at temperature zero, due to sampling and routing variance (Alexander, 2026); trivial prompt formatting changes shift LLM benchmark scores more than the gap between published model versions (Sclar, 2024). SparkToro's 2026 study found fewer than 1 in 100 AI runs produce identical brand lists for the same query (SparkToro, 2026), and SE Ranking found only 9.2% URL repeatability across repeated runs (SE Ranking, 2025). A tool that reports a point estimate without a variance band is reporting the mean of a very wide distribution and hoping you do not ask about the width.
Why this matters: variance is not a footnote. It is the thing that tells you whether a two-point score change is signal or noise. If a competitor's score goes from 40 to 44, you need to know whether the standard deviation on that score is 1 or 12, because the interpretation is opposite in the two cases.
How to read the answer: you want a specific number of samples (30+ is the rough minimum for stable aggregation) and a specific variance figure, standard deviation, confidence interval, something. "We run it many times" is not an answer. If the vendor cannot produce the number, they are either not running enough samples or not tracking the variance. Either way, the score is not statistically grounded.
Question 4: Model coverage, "Which models do you cover, at what frequency, and how do you handle cross-model disagreement?"
The AI-mediated discovery surface is not one thing. ChatGPT, Claude, Gemini, Perplexity, Grok, Mistral, and the smaller model families all behave differently. A tool that covers only one model measures one surface. A tool that covers four but averages them into a composite measures something that does not exist anywhere in reality, the "average model" is a fiction.
Why this matters: cross-model disagreement is first-class information, not noise to be averaged away. When four frontier models agree about your brand's positioning, that agreement is meaningful. When they disagree, that disagreement tells you the brand's positioning has not yet settled across the discovery surface, which is a business insight, not a data problem.
Cross-engine variance is also documented in the wider literature: the same brand question yields different answers across ChatGPT, Claude, Gemini, and Perplexity, with inter-engine variance exceeding intra-engine variance (SparkToro, 2026). The Brand Intelligence Gap research found cross-model sentiment sensitivity varies by 6.7× between the most reactive and least reactive frontier models (Banks, 2026). Averaging outputs from instruments with that sensitivity spread produces a composite score whose error structure is opaque, you cannot tell what the number means.
How to read the answer: ask for per-model outputs, not just the composite. A vendor who surfaces per-model results is serving you information. A vendor who hides per-model results behind a single "AI visibility score" is serving you simplicity at the cost of interpretability. Simplicity is valuable. Opaque simplicity is not.
Question 5: Data freshness, "What is the lag between a model update and your dashboard reflecting it?"
Frontier AI models update frequently. Training cutoffs shift. System prompts change. RAG layers get retuned. A tool measuring a moving target has to move with it.
Ask: "When OpenAI pushed the last major GPT update, how long until your dashboard showed the new behavior? And can you show me the before/after in the historical data?"
Why this matters: a vendor with no answer is sampling infrequently enough that they cannot see model updates. A vendor who re-runs the full measurement weekly will have a clean before/after line in their historical data, a visible step change. A vendor who does not have that line either is not sampling often enough or is smoothing it out of the presentation layer. Smoothing hides the thing you most need to see: that the underlying instrument changed, which means your longitudinal tracking just got a break in the record.
How to read the answer: "We're always up to date" is not an answer, it is a marketing line. "We re-sample the full question set every seven days and version the extraction pipeline" is an answer. Ask to see the versioning log. If they have one, they will share it. If they do not, they will tell you it is proprietary. Treat that as an answer.
What to do with the answers
Score honest, substantive responses out of 5:
- 5/5: The vendor is doing the work. Proceed with confidence.
- 3-4/5: Mixed. The tool may be useful for specific use cases. Know which questions did not get clean answers, and avoid making decisions that depend on those dimensions.
- 0-2/5: Dashboard theatre. The scores may look impressive. They do not represent what they claim to represent.
Try this
Pick the AEO vendor currently closest at hand, whether you pay for them, a client uses them, or they just sent you a pitch. Send the five questions above in one email. Give them three business days. The quality of the response is itself a measurement.
Key takeaways
- Five questions separate measurement-grade tools from dashboard theatre: prompt transparency, blind vs. named, variance, model coverage, and data freshness.
- Evasion is data. A vendor who will not answer a methodology question is telling you their answer would not help them.
- You do not need statistics training to run this diligence. You need to listen for specific versus handwavy.
What's next
In Lesson 7.3, The FOMO Industrial Complex, you will learn why the market rewards vendors who do not answer these questions. The market dynamics create pressure to sell urgency and reassurance, not rigor. Understanding those dynamics is how you stay clear-eyed while everyone around you is buying.
Reflection prompt: Of the five questions, which one are you least confident you could judge a good answer to? That is the question to practice on your next vendor call.
About this course
This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.
About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.
See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.