Prompt Sampling for AI Brand Measurement: The Sample Design Playbook
In this article, you will learn why prompt sampling for AI brand measurement decides what an AEO score actually means, the five diversity dimensions a defensible prompt set covers, the difference between sample size and sample design, and the buyer audit that surfaces sample design from a vendor's methodology page.
Two vendors, the same sample size, different scores
Two AEO vendors both claim 200 prompts per scan. You run the same brand through each. The scores disagree by 30 points. The first instinct is that one vendor is wrong and the other is right. The truer answer is that prompt sampling for AI brand measurement is more than a number. Sample size addresses how much variance there is around an estimate. Sample design addresses what the estimate is even measuring. Two vendors with the same sample size and different sample designs are measuring different things, and the scores will differ accordingly.
GenPicked publishes both the sample size and the sample design on every methodology page. The five diversity dimensions below are the same axes we audit our own prompt set against, and the same axes a buyer can use to audit any vendor.
This article is the deep dive on the second methodology question from the five tests an agency owner can run in 30 minutes. The companion piece on reproducibility covers how stable scores stay across days. This one covers what the prompts behind those scores actually look like, and how to tell defensible sample design from generic.
The under-disclosed half of the methodology question
Most AEO methodology pages disclose sample size. Far fewer disclose sample design. The implication a buyer is invited to draw is that 200 prompts is 200 prompts regardless of construction. The peer-reviewed measurement literature says the opposite.
Sample design is the under-disclosed half of methodology. Sclar and colleagues at ICLR 2024 showed that minor format changes to a prompt, including spacing, delimiter choice, and capitalization, produced up to a 76 percentage point swing in benchmark accuracy (sclar 2024 prompt sensitivity). A vendor disclosing sample size without disclosing prompt structure is reporting numbers that are partly a function of the template choice.
Sample size answers a statistical question. With more queries per period, the confidence interval around the reported score gets tighter. Sample design answers a construct question. With the right prompt mix, the score corresponds to brand presence in AI outputs. With the wrong mix, the same score corresponds to whatever the prompt template happens to amplify. Both questions matter. A defensible vendor answers both.
Sample size alone does not fix construct ambiguity. SE Ranking's 2025 AI Mode study ran 10,000 queries three times on the same day and found 9.2 percent same-day URL consistency across the runs (se ranking 2025 ai mode url consistency). A scan can be large and still report a number whose meaning depends on which prompts were sampled. Size addresses variance. Design addresses what the score actually measures.
The five diversity dimensions of a defensible prompt set
A prompt set is defensible when it covers the buyer-relevant variation in how a real customer might query AI about the category. Five dimensions matter.
Dimension 1: Intent
A real buyer might ask AI an informational question (what is AEO measurement), a commercial-investigation question (which AEO platforms work for boutique agencies), a navigational question (find the GenPicked pricing page), or a transactional question (start an AEO audit today). A prompt set that covers only one intent type is measuring brand presence inside that intent only. Defensible sample design samples across intents with documented proportions.
Dimension 2: Specificity
Queries vary from broad to long-tail. Broad queries (best CRM software) sit at the top of the funnel and surface a few dominant brands. Long-tail queries (best CRM software for two-person law firms in Florida) sit deeper and surface different brands. A scan biased toward broad queries reports a head-of-funnel score. A scan biased toward long-tail reports a niche score. The construct definition determines which mix is right, and the methodology page should name it.
Dimension 3: Format
Format is the dimension Sclar and colleagues showed is most likely to swing the result. Question format, command format, scenario format, list format, comparison format, and chat-history format produce different responses from the same engine. A defensible prompt set documents which formats are included and in what proportions.
Dimension 4: Persona
Real queries come from different buyer personas. A CMO might ask "what AEO platform should we evaluate this quarter." An agency owner might ask "which AEO tool fits a 12-person boutique." A learner might ask "what is share of model." Each persona's queries surface different brands. A persona-blind scan reports an unweighted average across personas the vendor never disclosed.
Dimension 5: Language
Lay language and technical language produce different responses. "Brand visibility in AI" and "AEO measurement methodology" overlap conceptually but trigger different model behaviors. A prompt set written entirely in technical jargon measures brand presence among technical-language buyers. A defensible set covers both registers in documented proportion.
Category-level versus brand-level prompts
The most consequential sample design choice is whether prompts name the target brand. Two valid choices exist, and they measure different things.
Category-level prompts ask the engine about the category or use case without naming any brand. The buyer observes whether the target brand appears in the response. This is the construct-aligned design when the question is "does this brand show up in AI when buyers ask about the category."
Brand-level prompts name the target brand in the query. They produce mention rates inflated by the model's tendency to echo back the input. Brand-level prompts have legitimate diagnostic uses (sentiment, attribute extraction, citation depth around a known brand), but they are not measuring brand presence in the same sense category-level prompts are.
Brand-anchored prompts inflate the score. The 864-observation paired-prompt experiment found a 22.5 percentage point mention-rate inflation when the brand was pre-supplied in the query versus a category-level prompt (blind vs named measurement). A defensible sample design defaults to category-first prompts and labels brand-anchored queries separately. A vendor reporting one composite score from a mixed set is reporting a weighted average the buyer cannot interpret.
The mechanism is documented in the sycophancy deep dive. The implication for sample design is direct: a methodology page that does not specify which prompts are category-first and which are brand-anchored has not disclosed the most consequential sample design choice.
The buyer audit for sample design
Open the vendor's methodology page. Run the prompt set against the five-question audit below. A vendor passing all five has disclosed defensible sample design. A vendor passing three has a partial disclosure. A vendor passing zero is selling a sample size without a sample design.
Audit question 1. Does the methodology page specify the intent distribution of the prompt set? Acceptable disclosure names approximate proportions (e.g., 40 percent commercial investigation, 30 percent informational, 20 percent navigational, 10 percent transactional). Unacceptable disclosure says "we cover all query types" without naming proportions.
Audit question 2. Does the methodology page specify the specificity distribution? Acceptable disclosure names the head-to-tail ratio (e.g., 30 percent broad head, 50 percent mid-tail, 20 percent long-tail). Unacceptable disclosure does not name the ratio.
Audit question 3. Does the methodology page specify the format distribution? Acceptable disclosure lists the formats used (question, command, scenario, comparison) and approximate proportions. Unacceptable disclosure says "we use natural language queries."
Audit question 4. Does the methodology page specify which prompts are category-level versus brand-level, and how the two are combined in the reported score? Acceptable disclosure names a default (category-level for the headline score) and a separate diagnostic label for brand-level. Unacceptable disclosure does not distinguish.
Audit question 5. Does the methodology page specify how the prompt set is versioned and updated? Acceptable disclosure names a versioning discipline (monthly review, quarterly refresh, change log). Unacceptable disclosure says nothing about how prompts evolve.
A defensible AEO methodology answers all five in writing. A defensible buyer asks for the document, not the verbal summary.
What defensible sample design looks like in practice
A defensible sample design covers five diversity dimensions in documented proportion. A prompt set that varies only on one dimension reports a number tied to that dimension. A prompt set that covers intent, specificity, format, persona, and language reports a number tied to the construct (construct validity). The five-dimension audit is the buyer's tool for verifying a vendor's sample design before the renewal decision.
GenPicked's prompt set documents the five dimensions and publishes the version history on the methodology page. The default headline score uses category-level prompts. Brand-level prompts run as separate diagnostic scans labeled as such, with the inflation noted in the report.
The longer companion on methodology transparency covers the broader vendor disclosure standard. The piece on reproducibility covers how stable scores stay across days once sample design is fixed. The piece on construct validity covers the 50-year-old measurement theory that turned all five dimensions into testable claims.
Frequently asked questions
What is the difference between sample size and sample design in AI brand measurement?
Sample size is how many prompts the scan runs per period. Sample design is which prompts the scan uses and how they are distributed across intent, specificity, format, persona, and language. Size addresses statistical variance around a score. Design addresses what the score actually measures. Both matter. A vendor disclosing only size has answered half the methodology question.
How many prompts per period is enough for a defensible AEO score?
The right answer depends on the confidence interval the buyer needs and the sample design. A diverse 200-prompt set covering the five dimensions can produce a more defensible headline score than a 10,000-prompt set sampling one dimension. Ask for the disclosed confidence interval and the disclosed prompt mix together, not the size in isolation.
Should AEO scans use category-level or brand-level prompts?
Default to category-level for the headline visibility score. Category-level prompts ask about the category or use case and observe whether the brand appears, which is what brand presence in AI actually means. Brand-level prompts are useful for sentiment, attribute extraction, and citation depth, but they should run as separate diagnostic scans labeled distinctly, with the inflation effect noted.
What is prompt diversity and why does it matter for AEO?
Prompt diversity is the variation across the prompt set on intent, specificity, format, persona, and language. It matters because a score derived from a narrow prompt set reflects performance on that narrow slice, not brand presence overall. A scan biased toward broad commercial queries reports head-of-funnel visibility. A scan biased toward long-tail informational queries reports a different number. A defensible set covers the buyer-relevant distribution.
How do I audit a vendor's prompt set without seeing the prompts themselves?
Read the methodology page. A defensible methodology discloses the intent distribution, the specificity distribution, the format mix, the category-versus-brand split, and the versioning discipline. A vendor calling the prompt set "proprietary" has refused to disclose the design. Treat that refusal as the answer, not as a temporary delay.
What is the most consequential prompt sampling decision a vendor makes?
The category-level versus brand-level default. A vendor naming the brand in the query is reporting a mention rate inflated by more than 20 percentage points. No statistical correction recovers the real signal after the fact. This single design choice is the largest source of score divergence between two vendors with the same sample size.
How often should an AEO vendor refresh the prompt set?
A defensible versioning discipline reviews the prompt set monthly and refreshes it quarterly or when the underlying AI engines change behavior. The methodology page should publish the version history so a buyer can tell whether a score movement reflects brand performance or a prompt set update.
See what disclosed sample design looks like
The GenPicked sample brief at genpicked.com/demo includes the prompt set diversity table for one brand, broken down by intent, specificity, format, persona, and language. The 14-day free trial includes the full methodology page with the sample design audit answered in writing.
Dr. William L. Banks III is Co-Founder of GenPicked. References documented in the GenPicked research wiki. Specific citations available on request.