Module 6: The Auditor's Playbook

Setting Up Your Measurement Environment

Joseph K. Banda·15 min read

Setting Up Your Measurement Environment

In this lesson, you will learn: How to set up everything you need to run a credible AEO audit, accounts on the four major AI models, a spreadsheet built for diagnostic analysis, a prompt library that stays organized across weeks of work, and logging conventions that turn one audit into a body of evidence.

This is Lesson 6.1 of Module 6, the hands-on module. You've spent five modules learning the theory. Now you build the workbench. Everything that follows in Module 6 assumes the environment we set up here. Get this right, and the rest of the audit GenPicked Academy teaches is mechanical.

Why the environment matters more than the tools

Most people who attempt an AEO audit do it wrong, not because they picked the wrong AI model, but because they didn't set up a measurement environment that lets them tell signal from noise. They open ChatGPT, ask a few questions, notice the brand did or didn't appear, and write up a vibe.

That's not an audit. That's a screenshot collection.

An audit is a structured comparison, same questions, same models, same recording format, repeated over time. The environment is what makes that possible. Measurement theorists have known this for decades: Churchill (1979) showed that valid measurement requires construct definition, domain specification, item generation, and consistent instrumentation, skip any step and the score becomes unreliable. Your environment is the instrumentation step. Without it, you can't answer the diagnostic questions that matter: does the brand appear in this model more than another? Does mention rate change when I run the same question a week later? Does the blind version of a question produce a different answer than the named version?

The discipline-tools ratio The single biggest predictor of whether your audit produces useful findings is not which tools you have. It is how consistently you log what you see. A free-tier account plus a disciplined spreadsheet beats a paid tool stack plus ad-hoc notes, every time.

Module 1, The four accounts you need

You will audit across four models. These are the frontier models for the AEO Strategist. Each answers category and brand questions differently, and the differences are diagnostic. See Model Susceptibility Spectrum for why this cross-model comparison is non-negotiable.

ChatGPT (OpenAI)

Go to chat.openai.com. Create a free account or upgrade to Plus if you want access to the full GPT-5 model and browsing. For audit purposes, a free account is enough to start, you can upgrade later when your workflow demands it.

What to verify after signup: - You can start a new chat - You can copy a response to your clipboard cleanly - You can return to previous chats in the sidebar (OpenAI calls this "History")

Claude (Anthropic)

Go to claude.ai. Create an account. Free tier gives you access to Claude Sonnet; Pro gives you higher message limits and access to the more capable Claude Opus model.

Claude is the most sycophancy-reactive of the four models, 6.7x more reactive than GPT-5 in the Banks (2026) experiment, consistent with the broader cross-model sycophancy pattern Sharma et al. (2024) documented across frontier LLMs. That makes Claude the most instructive model for this audit. You want to see sycophancy for yourself, and Claude shows it in the clearest form.

Gemini (Google)

Go to gemini.google.com. Sign in with a Google account. Free tier gives you Gemini 2.5 Flash. Gemini's grounding in Google Search makes it behave differently from the other three, it pulls in live web results more aggressively, which means its answers are more volatile day-to-day.

Perplexity

Go to perplexity.ai. Create an account. Perplexity is a hybrid, it retrieves live web results and synthesizes them with an LLM. That makes it structurally different from ChatGPT, Claude, and Gemini, which answer more from trained knowledge. In an audit, Perplexity often surfaces the brand differently because it is pulling fresh citations rather than reaching into a training corpus.

Why these four, and not others

The AI search market has many more models, You.com, DeepSeek, Grok, Mistral, others. You will encounter practitioners who audit across eight or twelve. That's fine. For your first audit, four is enough. It covers the three major Western frontier labs (OpenAI, Anthropic, Google) plus one hybrid (Perplexity). That span is wide enough to expose cross-model variance, the single most diagnostic signal in AEO measurement.

AEO claim, why four models: Different frontier models produce dramatically different sycophancy magnitudes under identical prompts. Sharma et al. (2024) documented systematic sycophancy across every major LLM tested, traced to RLHF preference optimization; Banks (2026) subsequently quantified the spread, showing Claude Sonnet 4.5 with a sentiment delta 6.7x larger than GPT-5. A single-model audit hides this variance and produces a measurement with unknown reliability. This is also why you should never let the same model generate the prompts and score the responses: Podsakoff et al. (2003) showed that using one instrument to measure both sides of a relationship inflates the apparent signal, a cleanly separated environment avoids that common-method artifact.

Module 2, Build your audit spreadsheet

Your spreadsheet is the foundation. Open Google Sheets (or Excel, if you prefer). Create a new file. Name it something like aeo-audit-<brand>-<YYYY-MM>.xlsx. Version it by month so your work stays traceable.

The column schema

Create a sheet called responses with these columns, in this order:

Column	Header	What it holds
A	`date`	ISO format: 2026-04-19
B	`session_id`	An identifier for the audit session, e.g., `audit-001-blind`
C	`model`	One of: chatgpt, claude, gemini, perplexity
D	`model_version`	The version string as shown in the UI (e.g., `gpt-5`, `claude-sonnet-4.5`)
E	`category`	The market category being probed (e.g., `fitness-wearables`)
F	`question_id`	Short ID from your prompt library (e.g., `Q-blind-01`)
G	`question_type`	One of: `blind`, `named`, `comparison`, `adversarial`
H	`question_text`	The full prompt, copied verbatim
I	`response_snippet`	The relevant 1-3 sentences from the response
J	`full_response_link`	Optional: link to saved full response in a Docs file
K	`brand_mentioned`	Y/N
L	`position`	Position in the response (1 if first-mentioned, 2 if second, blank if not mentioned)
M	`competitors_mentioned`	Comma-separated list of other brands named
N	`sentiment`	positive / neutral / negative / mixed
O	`sentiment_rationale`	One phrase: what word or phrase drove your sentiment call
P	`refusal_flag`	Y if the model refused or hedged out; blank otherwise
Q	`hallucination_flag`	Y if the model invented a product, feature, or fact; blank otherwise
R	`notes`	Freeform, anything worth remembering later

This is your raw data layer. Every response becomes one row. Do not aggregate in this sheet, aggregation happens in the analysis sheet we build in Lesson 6.4.

Why this schema

Each column earns its place. session_id lets you group runs together. model_version lets you trace findings to a specific model snapshot, crucial when models update silently. question_type is what enables the sycophancy-gap calculation. position captures the "first-mentioned" advantage that downstream buyers experience as the de facto recommendation. refusal_flag and hallucination_flag let you separate clean data from edge cases. notes preserves the pattern-recognition work you'll otherwise forget. The multi-column design is deliberate: Peikos (2024) showed that relevance and visibility are irreducibly multidimensional, a single "visibility score" column collapses information that actually moves independently.

Miss a column now and you will regret it later. You will run the audit, start analyzing, and realize you didn't capture the position of competitor mentions, so you can't answer whether the brand was framed as a leader or a follower. Build the whole schema now. Fill columns you don't use. You can always ignore a column; you can't recover data you didn't capture.

Setting up validation

In Google Sheets, use Data → Data validation to lock these columns to preset values: - model: dropdown with chatgpt / claude / gemini / perplexity - question_type: dropdown with blind / named / comparison / adversarial - brand_mentioned: dropdown with Y / N - sentiment: dropdown with positive / neutral / negative / mixed - refusal_flag, hallucination_flag: dropdown with Y / blank

Validation catches typos before they poison your analysis. The first time you try to filter for blind questions and half your rows say Blind or BLIND, you'll understand why.

Module 3, Build your prompt library

Your prompt library lives in a separate file, a markdown file, a Notion page, or a second tab in your spreadsheet. It is the single source of truth for what questions you are running. Every question gets a stable ID so you can re-run the same audit next month and trust the comparison.

The library structure

For each category you audit, maintain these sections:

Category context: one paragraph describing the market, the target brand, the 3-5 direct competitors, and the buyer audience you're probing for.
Blind questions: questions that do NOT name the target brand. These measure organic visibility. See Blind vs. Named Measurement for the methodology.
Named questions: questions that DO name the target brand. These trigger sycophancy and produce inflated mention rates. You run them deliberately to quantify the sycophancy gap.
Comparison questions: pairwise prompts that force the model to compare the brand against a specific competitor.
Adversarial questions: reputation-probing prompts ("What are the concerns with Brand X?"). Used sparingly and documented carefully.

Naming convention

Use stable IDs like Q-blind-01, Q-named-01, Q-comp-01. The prefix tells you the question type at a glance. The number is stable across audit sessions. When you re-run the same question set next month, the IDs match, and longitudinal comparison is trivial.

We design the actual question set in Lesson 6.2. For now, just set up the file structure.

AEO claim, why blind vs. named matters: In a controlled 2026 experiment with 864 paired observations across four AI models, named prompts produced a 22.5 percentage point mention-rate inflation over blind prompts (Banks, 2026, blind vs named measurement). Tools that only ask named questions are measuring something closer to prompt-compliance than genuine brand visibility.

Module 4, Your research notebook

Separate from the spreadsheet, you need a notebook. This is where you write narrative, patterns you're noticing, questions that emerge mid-audit, decisions about how to handle edge cases.

What to use

Notion, Obsidian, or a markdown file all work equally well.
Avoid Google Docs for this, it is too unstructured for longitudinal work.
Name it aeo-notebook-<brand>.md. One notebook per target brand. One notebook can span many audit sessions.

What to log in the notebook, not the spreadsheet

Session-level observations ("Today Claude was refusing to recommend any brand; something seems to have changed in the safety layer.")
Decisions you made about edge cases ("When the model hedged with 'I can't recommend a specific brand,' I logged it as refusal_flag: Y and did not record a mention.")
Hypotheses you want to test in a future session
Screenshots of unusual responses (link out to an image file or embed in Notion)

The spreadsheet is structured data. The notebook is unstructured reasoning. You need both. Analysts who skip the notebook produce spreadsheets they can filter but can't interpret six months later.

Module 5, Logging conventions

The single most important discipline in an AEO audit is logging in real time. Not after. Not in batches. Right as the response appears on your screen.

The five-second rule

Within five seconds of reading a model's response, log the following in your spreadsheet:

brand_mentioned (Y/N, the binary is the easy part)
position (if mentioned, number of the mention)
sentiment (gut call: positive / neutral / negative / mixed)
response_snippet (copy-paste the relevant 1-3 sentences)

If you wait until you've run five questions before logging, two things happen: you forget the details of the first three, and your sentiment calls drift toward the last response you saw. Log as you go.

Copy verbatim, not summarized

Your response_snippet column should contain the model's actual words. Not your paraphrase. Not a summary. The verbatim text. This matters for two reasons. First, if you need to revisit the data later, paraphrases hide signal. Second, verbatim text is what you cite in the audit report, it is the evidence.

Sentiment is a judgment call, document the judgment

Sentiment analysis is the most subjective field in your schema. Reasonable people disagree. That's why you have the sentiment_rationale column. For every sentiment call, write one phrase, the word or phrase that drove the call. Example: positive — "industry leader", mixed — "innovative but limited", negative — "expensive and proprietary".

Two benefits. First, your call becomes reproducible, another analyst can see your reasoning. Second, when you aggregate sentiment later, you can audit your own consistency. If three neutral rows all had "limited" as the rationale, you may have been miscoding.

Timestamps matter

Log the session date, not just the week. AI models change silently. A response pattern you observed on April 3 may be gone by April 17, the model was updated, the RAG layer was tuned, the safety guardrails were tightened. You cannot diagnose these changes without timestamps.

Module 6, The one-hour setup checklist

Here is the entire setup, as a checklist. Work through it in order. Budget one hour. Your environment will be complete and ready for the question-set design in Lesson 6.2.

Accounts (15 minutes) - [ ] ChatGPT account created and logged in - [ ] Claude account created and logged in - [ ] Gemini account created and logged in - [ ] Perplexity account created and logged in - [ ] All four can start a new chat and copy a response cleanly
Spreadsheet (20 minutes) - [ ] New file created, named aeo-audit-<brand>-<YYYY-MM>.xlsx - [ ] responses sheet has all 18 columns (A through R) - [ ] Data validation applied to model, question_type, brand_mentioned, sentiment, refusal_flag, hallucination_flag - [ ] One test row entered and deleted to confirm validation works
Prompt library (15 minutes) - [ ] File created: aeo-prompt-library-<brand>.md (or Notion page) - [ ] Category context section drafted (one paragraph) - [ ] Section headers created for blind / named / comparison / adversarial (questions added in 6.2) - [ ] Naming convention agreed on: Q-<type>-<number>
Research notebook (10 minutes) - [ ] File created: aeo-notebook-<brand>.md (or Notion page) - [ ] First session entry stub created, dated today - [ ] Understood what goes in the notebook vs. the spreadsheet

AEO claim, why log verbatim responses: AI model outputs for the same prompt vary significantly over time; Fishkin (2026) found cited-source repeatability under 1% across 2,961 paired AI-model prompts. Verbatim response snippets with timestamps are the only way to reconstruct what the model said at a specific point in time, paraphrases lose the evidentiary signal.

Exercise, your setup deliverable

Work through the one-hour checklist above. At the end:

Screenshot your spreadsheet with all 18 columns visible.
Paste the category context paragraph from your prompt library into your notebook.
In your notebook, write one paragraph: why did you choose the brand you chose? What do you expect the audit to reveal?

Save these. You'll reference them in Lesson 6.5 when you write the audit report. The "expectations paragraph" is especially useful, comparing what you expected to what you found is one of the most instructive moments in any audit.

Handling edge cases you'll encounter in setup

"The Claude account asks for a phone number"

That's normal. Use your actual phone number; Anthropic uses it for abuse prevention, not marketing.

"My employer's firewall blocks Perplexity"

Audit from a personal device on your home network. Many enterprise firewalls still classify Perplexity as uncategorized. This is a temporary friction, not a blocker.

"Should I pay for the Plus/Pro tiers?"

For your first audit: no. Free tiers are sufficient to see the methodology work. Upgrade when you hit a specific friction, message limits, or you want access to a specific model version. Pay for capability, not for status.

"Can I skip Gemini or Perplexity?"

You can, but you shouldn't. Cross-model variance is the single most diagnostic signal in AEO measurement. A four-model audit reveals patterns a two-model audit cannot. Three is the minimum. Four is the baseline.

Takeaways

The environment is the audit. The tools matter less than the logging discipline. A free-tier account with a well-structured spreadsheet beats a paid stack with ad-hoc notes.
Four models, not one. Cross-model variance is the most diagnostic signal you can measure. Claude, ChatGPT, Gemini, and Perplexity span the frontier labs and expose the variance that single-model audits hide.
Log verbatim, in real time, with rationale. Paraphrases hide signal. Batch logging drifts. Undocumented sentiment calls are unreproducible. Five-second rule, every response.

What's next

Now that your environment is set up, Lesson 6.2, Designing Your Question Set, walks you through building the prompt library for a specific brand and category. You will leave Lesson 6.2 with a complete, named question set ready to execute. Lesson 6.3 runs the audit. Lesson 6.4 does the math. Lesson 6.5 writes the report.

Reflection prompt

Before you move on: look at your spreadsheet. Imagine it is six months from now and you have 300 rows of data. Which column do you think you will use most in analysis? Which column might you have skipped in a rush that you're now grateful you kept? Write a sentence in your notebook. You're starting to think like an AEO Strategist.

Templates referenced: Audit Spreadsheet (template forthcoming). The schema in this lesson is the authoritative reference until the Google Sheets template ships.

About this course

This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.

About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.

See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.

Knowledge check · ungraded

Check your understanding before moving on

1. Why is environment hygiene (fresh sessions, no logged-in personalisation) important in AEO measurement?

It keeps API costs down
It prevents the engine from personalising answers to the auditor, contaminating the data
It avoids breaking the engine's terms of service
It speeds up runs

← Previous lesson

designing-your-question-set

Next lesson →

Continue the curriculum