Running the Audit Across Four Models
In this lesson, you will learn: How to execute the audit GenPicked Academy teaches cleanly, the query protocol that keeps your data comparable across models, how to implement Latin Square counterbalancing in a working session, how to record responses without contaminating them, and how to handle the three edge cases you will absolutely encounter (model refuses, model hallucinates, model hedges).
This is Lesson 6.3. Your environment is built. Your question set is designed. Now you run the thing. Execution is where the audit either becomes real data or becomes a drawer full of screenshots. The difference is almost entirely protocol.
Budget four to five hours for a full 15-question, four-model run. Do it in one or two sittings, not across a week. Consistency of execution matters.
The case for a written protocol
An AEO audit is a small experiment. In any experiment, protocol is what lets you compare conditions. If you query ChatGPT in a fresh chat but query Claude in the middle of an existing conversation, your data is no longer comparable, the Claude responses are conditioned on the conversation history, the ChatGPT responses are not.
The protocol in this lesson is simple, but it has to be followed exactly. Every shortcut you take introduces a confound that will show up in analysis as a "surprising finding" that is actually an artifact.
Your future self will thank you When you look at your data in Lesson 6.4 and see a pattern, you want to be able to trust it. The only way to trust it is to know your protocol was clean. Shortcut now, doubt everything later.
Module 1, The universal query protocol
Every query you run, in every model, follows the same five-step protocol.
Step 1, Open a fresh chat
Every question starts in a new conversation. Not continued. Not in a sidebar chat with prior context. A fresh chat.
Why: AI models condition on conversation history. A model that has already discussed Oura Ring in an earlier message will mention Oura more readily in subsequent messages. That's contamination, the prior turns are acting as a named-prompt anchor, even if the current turn is worded blind.
How: In each UI, look for "New chat" (ChatGPT), "New chat" (Claude), "New chat" (Gemini), or the + icon (Perplexity). One fresh chat per question, every time. Yes, that's 60 fresh chats for a 15-question audit across four models. Do it anyway.
Step 2, Paste the question verbatim
Copy the question from your prompt library. Paste it into the chat. Do not retype it. Do not "improve" it in the moment. Do not add a lead-in like "Hi, I'm curious about..."
Why: if every question is phrased slightly differently, you have no reliable comparison across models. The prompt is the instrument. The instrument must be identical. Sclar et al. (2024) showed that trivial prompt formatting changes, capitalization, punctuation, whitespace, shift LLM benchmark scores by more than the gap between published model versions. Treat the pasted prompt as sacred.
Step 3, Let the model respond fully
Wait for the full response. Do not interrupt. Do not ask a follow-up in the same chat. The audit measures first-turn responses only.
Why: follow-ups contaminate. The model's second-turn behavior is different from its first-turn behavior. For this audit, we measure first-turn only, which is also what most buyers actually see, most people ask AI one question and take the first answer.
Step 4, Log immediately
Within five seconds of reading the response, log the row in your spreadsheet:
brand_mentioned(Y/N)position(if mentioned)sentiment(positive / neutral / negative / mixed)sentiment_rationale(one phrase)response_snippet(copy the relevant 1-3 sentences verbatim)competitors_mentioned(comma-separated)refusal_flag(Y if refused; blank otherwise)hallucination_flag(Y if model invented a fact; blank otherwise)
This is the five-second rule from Lesson 6.1. In execution, you will be tempted to batch the logging. Don't. Log as you go.
Step 5, Close or archive the chat
Close the chat (or archive it in ChatGPT's interface). Open a fresh one for the next question. Reset the state.
The four-model loop
For each question in your set, run Steps 1-5 in all four models before moving to the next question:
Question Q-blind-01 → ChatGPT → Claude → Gemini → Perplexity
Question Q-blind-02 → ChatGPT → Claude → Gemini → Perplexity
Question Q-blind-03 → ChatGPT → Claude → Gemini → Perplexity
... and so on
Alternative (equally valid): run one model all the way through before moving to the next. Either order works. Pick one and stick with it.
Why consistent ordering matters: if you run ChatGPT first for half your questions and last for the other half, any drift in your logging discipline (fatigue, attention, care with sentiment calls) gets confounded with the ChatGPT data. Pick a pattern and stick to it.
Module 2, Latin Square counterbalancing, in practice
Latin Square counterbalancing is how you cancel out order effects. In comparison questions especially, the order in which you list competitors affects the model's response. "Oura vs. Whoop" produces different framing than "Whoop vs. Oura." This is not anecdote, Craswell et al. (2008) established foundational position-bias models showing that higher-ranked options receive disproportionate attention independent of their relevance, and Chiang et al. (2024)'s Chatbot Arena, the largest deployed pairwise LLM evaluation, uses Bradley-Terry over counterbalanced pairs precisely because unordered pairwise data contains ineradicable position effects. See Latin Square Counterbalancing for the underlying method.
In a full-scale research study, you would use a formal Latin Square design with many more conditions. For a practitioner audit, the practical implementation is simpler: run each comparison question in both orders, across all four models.
The practical implementation
Take your comparison question Q-comp-01:
- Order A: "Compare Oura Ring and Whoop for recovery tracking."
- Order B: "Compare Whoop and Oura Ring for recovery tracking."
Run both orders, in all four models. That's 8 runs per comparison question. For three comparison questions, 24 runs total, part of the 120-run budget.
In your spreadsheet, add an order_variant column (or use the session_id to track): A for the original order, B for the reversed order. When you analyze the data in Lesson 6.4, you will average across orders to neutralize position bias.
When you can't run both orders
If time is genuinely tight, you can skip the order reversal for blind and named questions (which don't have a natural "order" of competitors anyway) and only counterbalance the comparison questions. That cuts the budget from 120 runs to roughly 90. That's the minimum acceptable audit.
Do not skip the counterbalancing on comparison questions. The order effect there is well-documented and large enough to flip rank positions.
AEO claim, position bias in comparisons: Craswell et al. (2008) first formalized position bias in ranking, higher-listed options get clicked at rates inconsistent with objective relevance, and Banks (2026) replicated the effect inside LLM outputs: across 864 paired observations, brand order in comparison prompts shifted the first-mentioned brand's rank by an average of 0.83 positions toward the top slot. Comparison questions run in only one order produce measurements that conflate position bias with genuine preference.
Module 3, Recording responses without contaminating them
Logging feels mechanical, but it's where contamination sneaks in. Three rules keep it clean.
Rule 1, Copy snippets verbatim, never paraphrase
Your response_snippet should contain the model's actual words for the 1-3 sentences that matter most to the audit finding. Not your summary. The actual words.
Why: your summary embeds your interpretation. The verbatim text preserves the evidence. When you write the audit report in Lesson 6.5, you will quote these snippets directly. Paraphrases are unquotable. They also shift over time, your paraphrase today is not the same as your paraphrase in six months.
Rule 2, Score sentiment in the moment, with rationale
Sentiment is a judgment call. Make the call when the response is fresh, within five seconds of reading it, and write down the phrase that drove the call.
Good entries:
- positive — "industry-leading accuracy"
- mixed — "solid but limited"
- negative — "overpriced for the feature set"
Bad entries:
- positive (no rationale, unreproducible)
- mixed — the model said some good and some bad things (vague, you'll forget what)
The rationale is what makes your sentiment scoring auditable. When you re-read the row in six months, you know exactly why you scored it the way you did.
Rule 3, Score position strictly
position = 1 means the brand was the first brand named in the response. position = 2 means second-named. And so on. If the brand appears multiple times, record the earliest position.
Position is a proxy for the "de facto recommendation" effect, buyers pay disproportionate attention to the first-mentioned option. A brand mentioned in position 5 of a list of 7 is effectively invisible, even though brand_mentioned = Y.
Module 4, Edge cases: the three you will hit
You will run into weird responses. Every first-time auditor does. Here's how to handle the three most common.
Edge case 1, The model refuses or hedges
Symptom: the model says something like "I can't recommend a specific brand without knowing more about your needs," or "I don't have enough information to give a specific recommendation."
Examples in the wild: Claude is the most likely to do this, especially on named prompts it interprets as asking for commercial endorsement. Gemini does it occasionally when buyers ask about competitive positioning.
How to handle it:
1. Set refusal_flag = Y in your spreadsheet.
2. Leave brand_mentioned = N (because the brand wasn't mentioned, the model didn't commit).
3. Leave position blank.
4. Score sentiment as neutral and note the rationale: neutral — refused to recommend.
5. Copy the refusal text verbatim into response_snippet.
6. Do NOT re-ask the question with different phrasing to "get an answer." That contaminates the session and is no longer the same audit.
Why: refusals are data. The rate at which a model refuses to recommend is itself a finding. Claude refusing 30% of named questions tells you something important about Claude's alignment layer and the usefulness of Claude-based AEO tools.
Edge case 2, The model hallucinates
Symptom: the model invents a product, a feature, or a fact. Example: "The Oura Ring 5 (released in late 2025) introduced continuous glucose monitoring", when no such product exists.
How to handle it:
1. Set hallucination_flag = Y.
2. Still log brand_mentioned, position, sentiment, the hallucination doesn't change the fact that the brand was mentioned.
3. In notes, write what was hallucinated and what is actually true. Example: Invented Oura Ring 5 with CGM; actual current product is Oura Ring 4 without CGM.
4. Copy the hallucinated passage verbatim into response_snippet.
Why: hallucinations are also data. A model that hallucinates positively about your brand is producing risky visibility, if a buyer acts on the hallucinated feature, your support team pays the cost. A model that hallucinates negatively (inventing a concern that doesn't exist) is actively damaging your reputation. Track these.
Edge case 3, The hedged answer
Symptom: the model provides an answer, but wraps it in so many hedges that it's unclear whether the brand was actually recommended. Example: "Some athletes use Oura Ring, though it's not universally recommended, and there are other options."
How to handle it:
1. brand_mentioned = Y (the brand was named).
2. position = the position of the first mention.
3. sentiment = mixed, with rationale: mixed — hedged recommendation.
4. In notes, write "hedged" so you can cluster hedged responses in analysis.
Why: hedged recommendations are functionally weak mentions. In analysis, you may find that one model hedges 60% of its "mentions" of your brand while another never hedges. That's a diagnostic pattern worth a whole section of your audit report.
AEO claim, cross-model hedging variance: Models with stronger alignment layers tend to produce more hedged and refused recommendations on brand-named prompts. Sharma et al. (2024) traced this reactivity to RLHF preference optimization, models trained for helpfulness learn to hedge when the user's framing invites disagreement. Banks (2026) then quantified the spread, showing Claude was 6.7x more reactive to prompt framing than GPT-5, with much of that reactivity expressed as sentiment hedging rather than binary mention/no-mention shifts. Hedge rate is therefore a first-class audit metric, not a footnote.
Module 5, The second pass
Once you complete the first run of all 15 questions across all four models, you are not done. You need a second pass to observe volatility.
Why a second pass
AI model outputs are not stable. The same question asked today will produce a different answer next week, new top-K sampling, new training updates, new retrieval hits. Alexander (2026) showed that identical prompts yield non-identical answers across repeated calls even at temperature zero, due to sampling and routing variance; Fishkin (2026) documented the same volatility in brand-recommendation prompts specifically. See AI Recommendation Consistency for the research on how unstable this actually is.
Without a second pass, you have a snapshot. With a second pass, you have a range. A range is much more useful in a report, you can say "Oura appeared in 60-80% of ChatGPT blind responses over a two-day sampling window" rather than "Oura appeared in 60% of responses on April 4."
How to run the second pass
Wait at least 48 hours after your first pass. Ideally a week. Then run the exact same 15 questions, in all four models, following the same protocol. Log the results in your spreadsheet, using a different session_id (e.g., audit-002).
The gap between passes lets you observe: - Stability: does the brand appear consistently, or does its mention rate swing 20+ points? - Drift: are the competitors being named changing over time? - Position volatility: does the brand's position in lists move?
One pass tells you "here's the current state." Two passes tell you "here's the current state and here's how stable it is." Stability is what buyers actually need to know.
Exercise, run your audit
This is the execution phase. Using the question set you designed in Lesson 6.2:
- Block a 4-5 hour window (or two 2-hour sessions).
- Run the universal query protocol for all 15 questions across all four models, first pass.
- Log every response in real time, using your spreadsheet schema.
- Note any edge cases in your research notebook.
- Wait 48 hours minimum.
- Run the second pass.
When you finish, your spreadsheet should have roughly 120 rows. Your notebook should have two or three paragraphs of observations per session, patterns you noticed, edge cases you encountered, hypotheses that emerged.
You now have raw data. Lesson 6.4 turns it into metrics.
Common execution mistakes
"I ran the questions in a single long conversation"
The responses are contaminated by conversation history. You'll need to re-run with fresh chats. This is the most common rookie mistake.
"I rephrased a question mid-audit because the model seemed confused"
Your data for that question is no longer comparable across models. Either re-run all four models with the new phrasing, or restore the original phrasing and treat the confused response as data.
"I logged responses in a batch at the end"
Your sentiment calls are less reliable than they would have been with real-time logging. Flag this as a limitation in your report's methodology section.
"I only ran ChatGPT because I ran out of time"
That's not an AEO audit. That's a ChatGPT audit. Budget the full four-model scope or narrow the question set, but don't narrow the model set.
"I got curious and ran follow-up questions in the same chat"
Those follow-ups are useful for context but not for measurement. Record the first-turn response only. Note any interesting follow-up behavior in your notebook, not your spreadsheet.
Takeaways
- Protocol beats improvisation. Fresh chat, verbatim prompt, full response, immediate logging, close chat. Every question, every time.
- Counterbalance your comparisons. Run comparison questions in both orders. The order effect is large enough to flip rankings if you don't.
- Edge cases are data. Refusals, hallucinations, and hedges are not noise, they are diagnostic. Flag them and analyze them.
What's next
You now have raw audit data, roughly 120 response rows across two sessions. Lesson 6.4, Calculating Your Diagnostic Metrics, walks through the math. Pairwise win-rates, Bradley-Terry scores, variance calculations, confidence bands. All in your spreadsheet, no statistics software required.
Reflection prompt
Before moving on: look at your two audit sessions side by side. Which finding surprised you? Which finding did you expect and see confirmed? Write a paragraph in your notebook. The surprise is where your analysis should focus most in Lesson 6.4. The expected finding is where you should pressure-test whether the data actually supports what you thought it would.
Templates referenced: Audit Spreadsheet (template forthcoming). Use the Lesson 6.1 schema as the authoritative reference for now.
About this course
This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.
About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.
See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.