Designing Your Question Set
In this lesson, you will learn: How to design a question set that produces diagnostic AEO data, the four question types that matter, the phrasing rules that keep your prompts clean, how many questions you actually need (fewer than you think), and how to avoid the single most common rookie mistake: leading prompts that contaminate the measurement before you even run it.
This is Lesson 6.2 of Module 6. Your measurement environment is built. Now you design the prompts you will run through it. The quality of your audit is largely determined here. Good questions produce signal. Sloppy questions produce noise that looks like signal, the most dangerous failure mode in AEO work.
We'll design a question set end-to-end for one real category, so you can see the method GenPicked Academy teaches work.
Why question design is the bottleneck
A common pattern among first-time AEO auditors: they rush through prompt design to get to the "real" work of running the audit. Then their data looks weird, their findings are hedged, and they blame the models. Almost always, the problem is upstream. The questions were leading, ambiguous, or mixed multiple signals, and no amount of careful running fixes that.
The analogy is survey research. A well-designed survey with 30 respondents produces more reliable findings than a sloppy one with 3,000. The instrument is the leverage point. Churchill (1979)'s measurement paradigm, define the construct, specify the domain, generate the items, then purify, is the scaffold: in AEO, your question set is the item-generation step, and no running discipline downstream compensates for skipping it. The question set is your instrument.
The instrument beats the sample size You will feel pressure to ask "more questions" to get "more data." Resist. Ten clean questions beat thirty contaminated ones. The audit's credibility lives in the prompt design, not the question count.
Module 1, The four question types
You will write questions in four categories. Each tests a different facet of brand visibility. See Blind vs. Named Measurement for the underlying methodology.
1. Blind questions
Definition: Questions that do NOT name the target brand. They probe category-level visibility.
Example: "What are the best fitness wearables for serious athletes?"
What they measure: Organic brand visibility. If the target brand appears in the response to a blind question, the model has genuinely associated that brand with the category. This is the gold-standard signal.
Why they matter: Blind questions are the only question type that measures what buyers actually experience when they ask AI for a category recommendation without having a specific brand in mind. That scenario, the category-first buyer, is the high-leverage moment for AEO. Blind mention rate is therefore the most important single metric in your audit.
2. Named questions
Definition: Questions that DO name the target brand, often as one of several options or as an explicit anchor.
Example: "What are the best fitness wearables like Oura Ring?" or "Is Oura Ring a good choice for sleep tracking?"
What they measure: Sycophancy-contaminated mention rate. The brand will almost always appear, because the prompt anchors the model to it. Perez et al. (2023) documented this directly, LLMs shift their answers toward user-stated views in evaluations, so any prompt that names a brand positively acts as a soft endorsement the model will mirror back.
Why you still run them: Named mention rate by itself is useless for measuring visibility. But the gap between named and blind is diagnostic. If your brand appears 95% of the time when named and 20% of the time when unnamed, you have a 75-point sycophancy gap, and you now know that your measured "visibility" in any named-prompt tool is mostly an artifact of prompt wording. See Sycophancy for why this happens.
3. Comparison questions
Definition: Pairwise or tripartite prompts that force the model to compare the brand against specific competitors.
Example: "Compare Oura Ring, Whoop, and Apple Watch for sleep tracking."
What they measure: Relative framing. Does the model frame the brand as a leader, a competitor, or an also-ran when forced to compare it side-by-side?
Why they matter: Comparison questions are the most common B2B buyer query type. "Salesforce vs. HubSpot." "Notion vs. Roam." Buyers consistently ask AI for head-to-head comparisons. How your brand is positioned in those comparisons, the adjectives, the framing, the order, shapes the buying decision. This is where AEO meets sales enablement.
4. Adversarial questions
Definition: Reputation-probing prompts that invite the model to surface concerns, risks, or downsides.
Example: "What are the concerns with Oura Ring?" or "Why might someone avoid Oura Ring?"
What they measure: Reputation drift in the training corpus. If the model readily surfaces five concerns, those concerns exist somewhere in the web data the model was trained on, user forums, Reddit, product reviews, comparison articles. This is useful defensive intelligence.
Why use sparingly: Adversarial prompts skew negative by design. Over-sampling them produces a misleading picture. Use 1-2 adversarial questions per audit, not 5-10. They exist to surface known concerns for the report's "what to watch" section, not to rank the brand.
Module 2, The phrasing rules
Bad phrasing contaminates your data. Here are the rules that separate a clean question set from a noisy one.
Rule 1: No leading prompts
A leading prompt telegraphs the answer you want. The model, being helpful, provides it. Your "finding" is then indistinguishable from the prompt.
Bad: "Why is Oura Ring considered the best sleep tracker?" Good: "What are the best sleep trackers?"
The first question contains the answer inside the question. The second lets the model tell you what it actually knows.
Bad: "What are the top AI brand visibility tools like Profound, Peec AI, and Athena?" Good: "What are the top AI brand visibility tools?"
The first question names three brands in the prompt. The model will almost certainly include them in the answer, not because they're the best, but because you named them. You just measured your own prompt. This is the practical reason eval-writers like Perez et al. (2023) go to such lengths to construct non-leading prompts: a prompt that smuggles in the answer produces data that reflects the prompt, not the model.
Rule 2: Match buyer language, not category jargon
Buyers don't ask AI for "conversion rate optimization platforms." They ask, "how do I get more people to buy from my website?" If your question uses industry jargon that buyers don't use, you're measuring something buyers would never encounter.
Jargon-heavy (bad): "What are the leading Answer Engine Optimization platforms?" Buyer-aligned (good): "How do I make sure AI tools like ChatGPT recommend my company?"
Both questions are probing the same space. The second matches what an actual CMO types into Claude on a Tuesday afternoon. Use the second form.
Rule 3: Avoid temporal ambiguity
"Best" is vague. "Best in 2026" is specific. "Best for a Series A SaaS with 20 employees" is diagnostic.
Vague (bad): "What are the best CRMs?" Specific (good): "What are the best CRMs for an early-stage B2B SaaS startup in 2026?"
The second version limits the response space. The model's answer is more comparable across models because the scope is anchored.
Rule 4: One question, one signal
Mixing multiple intents in one question produces answers you cannot score cleanly.
Mixed (bad): "What are the best fitness wearables, and why is Oura a good choice?" Decomposed (good): - Question 1 (blind): "What are the best fitness wearables?" - Question 2 (named): "Is Oura a good choice among fitness wearables?"
Two clean signals beat one muddled one.
Rule 5: Write the question the way a human would type it
AI models are trained on human prose. Awkward or robotic phrasing produces awkward, less reliable responses. Write questions the way you would type them into a search bar or speak them to a knowledgeable friend.
Robotic (bad): "Enumerate fitness wearable brands for athletic populations." Human (good): "What fitness wearables do serious athletes actually use?"
The second is how your buyer would ask. Match that register.
Module 3, How many questions do you need?
Less than you think.
The minimum viable question set
For a first audit of a single brand in a single category:
- 5 blind questions
- 5 named questions (paired with the blind questions, same category, same intent, but with the brand named)
- 3 comparison questions (target brand vs. top 3 competitors, one pairwise prompt each)
- 2 adversarial questions (concerns / risks / downsides)
That's 15 questions. Multiplied by four models, that's 60 total runs. Multiplied by two sampling passes (same questions a few days apart, to observe volatility), that's 120 responses.
At roughly one minute per response, plus logging time, budget four to five hours for the execution phase in Lesson 6.3. Plan accordingly.
Why not more
More questions feel more rigorous. In practice, they often produce worse analysis, because you run out of time to log carefully, and late-session responses are recorded with less precision than early-session ones. Diminishing returns set in fast.
If you want more data, run the same 15 questions on a second category, or re-run the set next month. Longitudinal data is more valuable than breadth in a single session.
The paired structure
Your blind and named questions should be paired, each blind question has a matching named version that probes the same underlying intent. This is what makes the sycophancy gap calculable. Example pair:
- Blind: "What are the best fitness wearables for serious athletes?"
- Named: "What are the best fitness wearables for serious athletes, for example, is Oura a good option?"
Same category, same buyer intent, same time horizon. The only variable is whether the target brand is named. That's the experimental design that makes sycophancy measurable. See Non-Uniform Distortion for why paired design matters for multiple metrics, not just mention rate.
AEO claim, paired prompt design: Sharma et al. (2024) established that sycophancy is a systematic RLHF-driven behavior that only emerges cleanly under paired-prompt comparison; Banks (2026) applied that paired design to brand measurement across 864 observations and found sycophancy distorts mentions, rank, and sentiment in different directions simultaneously, a multi-metric distortion that unpaired designs cannot detect. Pairing is not optional if you want to catch the full distortion pattern.
Module 4, Worked example: fitness wearables for Oura Ring
Let's design a real question set together. Target brand: Oura Ring. Category: fitness wearables. Direct competitors: Whoop, Apple Watch, Garmin, Fitbit.
Category context paragraph
Fitness wearables are a consumer market dominated by Apple Watch (general-audience smartwatch), with specialist entrants like Oura Ring (sleep and recovery focus), Whoop (strap-based recovery monitoring), Garmin (serious athletes and endurance), and Fitbit (Google-owned mainstream fitness). Buyers typically come at the category with one of three intents: general fitness tracking, sleep and recovery optimization, or serious endurance training. Oura's positioning is strongest in the sleep/recovery intent. We are probing how AI models surface Oura relative to competitors across all three intents.
Blind questions (5)
Q-blind-01: What are the best fitness wearables for someone who cares most about sleep tracking?Q-blind-02: What are the best fitness trackers for everyday health monitoring in 2026?Q-blind-03: I want a wearable that tracks my recovery and readiness, not just my workouts. What should I look at?Q-blind-04: What are the most accurate fitness wearables currently on the market?Q-blind-05: If I care about battery life and don't need a screen, what fitness wearables should I consider?
Notice: none of these name Oura. Each one probes a buyer intent where Oura could plausibly appear, sleep, recovery, accuracy, screenless form factor.
Named questions (5, paired)
Q-named-01: What are the best fitness wearables for someone who cares most about sleep tracking, is Oura a good option?Q-named-02: Among fitness trackers like Oura, Whoop, and Apple Watch, which is best for everyday health monitoring in 2026?Q-named-03: For tracking recovery and readiness, is Oura Ring a serious option?Q-named-04: How accurate is the Oura Ring compared to other fitness wearables on the market?Q-named-05: Is Oura Ring a good choice if I want long battery life and don't need a screen?
Notice: each named question pairs cleanly with a blind question. Same category, same intent. Only the brand-naming varies.
Comparison questions (3)
Q-comp-01: Compare Oura Ring and Whoop for recovery tracking. Which is better, and why?Q-comp-02: If I'm choosing between Apple Watch and Oura Ring for general health tracking, what are the tradeoffs?Q-comp-03: Oura Ring vs. Garmin, which is better for an endurance athlete?
Notice: each comparison pits Oura against a specific competitor with a specific use case. Generic "compare all fitness wearables" prompts produce mush; scoped comparisons produce diagnostic framing.
Adversarial questions (2)
Q-adv-01: What are the main concerns or complaints about Oura Ring?Q-adv-02: Why might someone choose a different fitness wearable instead of Oura Ring?
These surface reputation signal. Useful for the "what to watch" section of the audit report, not for ranking.
Why this question set works
It's 15 questions. It probes the three high-value buyer intents (sleep, general health, endurance). It pairs blind and named cleanly. It scopes comparisons to specific competitor-use-case combinations. It includes defensive reputation questions without over-weighting them.
You could swap Oura for any target brand and adapt this structure directly. That's the point, the structure generalizes.
AEO claim, buyer-intent coverage: Fishkin (2026) analyzed 2,961 AI-model prompts and found cited-source repeatability under 1% across identical prompts, with major variance concentrated in prompts that lacked specific buyer-intent scoping. Intent-scoped questions, "for sleep tracking," "for endurance athletes", produce more stable and more interpretable data than generic category prompts.
Module 5, Common mistakes and how to fix them
Mistake 1: All your blind questions are too general
Symptom: every model returns the same top-three brands to every question. Fix: add buyer-intent specificity. Not "best fitness wearable", "best fitness wearable for serious runners training for a marathon."
Mistake 2: Your named questions are actually leading questions
Symptom: the model praises the brand in every named response. Fix: frame the brand as one option among several, or ask genuinely open questions ("is Oura a good choice for X?"). Don't ask "why is Oura the best?"
Mistake 3: Your comparison questions include too many brands
Symptom: the model produces a flat list without framing. Fix: keep comparisons to 2-3 brands maximum. "Oura vs. Whoop" produces more diagnostic framing than "Oura vs. Whoop vs. Apple Watch vs. Garmin vs. Fitbit."
Mistake 4: Your adversarial questions dominate the set
Symptom: the audit looks like a hit piece. Fix: cap adversarial at 2 questions (out of 15). They are spice, not entree.
Mistake 5: You never run the set by a second reader
Symptom: a bias in your phrasing that you couldn't see yourself. Fix: share your question set with one colleague before running it. Ask them to flag anything that reads as leading. Ten minutes of peer review saves four hours of wasted audit time.
Exercise, design your own question set
Pick the brand you chose in Lesson 6.1. Using the worked example as a template:
- Write your category context paragraph (100-200 words).
- Write 5 blind questions.
- Write 5 named questions, each paired to a blind question.
- Write 3 comparison questions (brand vs. top 3 competitors).
- Write 2 adversarial questions.
- Save the full set in your prompt library file, with stable IDs (
Q-blind-01throughQ-adv-02).
Then run the peer-review check: read each question aloud, as if a friend was asking you an honest question they genuinely didn't know the answer to. If any question sounds like you already know the answer you want, rewrite it.
Takeaways
- Question design is the bottleneck, not question count. Fifteen clean questions produce better data than fifty sloppy ones. Budget your rigor accordingly.
- Pair blind and named. Without pairing, you cannot calculate the sycophancy gap, the most diagnostic single metric in AEO measurement.
- Scope by buyer intent, not by category alone. "Best wearable" is too general. "Best wearable for sleep tracking" produces data you can actually interpret.
What's next
With your question set designed, you are ready to run the audit. Lesson 6.3, Running the Audit Across Four Models, walks through execution: how to query cleanly, how to implement Latin Square counterbalancing in practice, and how to handle the edge cases (refusals, hallucinations, hedged answers) you will absolutely encounter.
Reflection prompt
Look at the question set you just wrote. Which question are you most uncertain about, the one you suspect might be leading, or might mix intents? Write a sentence in your notebook explaining your concern. Rewriting that question now is cheaper than rerunning the audit later.
About this course
This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.
About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.
See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.