Your client's Q2 business review is in three weeks. The VP of Marketing is going to ask: "Are we showing up in ChatGPT? Perplexity? Google's AI thing?" And your agency doesn't have a structured answer. You've run a few manual checks. Some clients look okay. Others are invisible. But you don't have a framework to grade them, compare them to competitors, or present the work with confidence.
This is the grading checklist. It's designed for agencies managing 5-50 client brands across a portfolio. Each audit dimension has a clear A-F rubric, a specific measurement method, and a priority-ranked fix list. Run this on your worst-performing client this week. You'll find at least 3 dimensions where a small structural change will move the needle within 30 days. Use that win in the next quarterly business review. The 12 dimensions together form a composite AEO Citation Score (ACS) that you can track month-over-month and report to clients as a concrete indicator of their AI visibility progress.
Start your 14-day free trial
Growth plan free for 14 days. Five AI engines. Full agency dashboard.
Start free trialWhy a structured audit matters right now (3 hero stats)
Three numbers that should sit in your quarterly business review deck with your client:
Your client's buyer opens ChatGPT or Perplexity in week one of their research. If your client is not on the shortlist that AI generates, the deal is lost before the first sales email. A structured audit tells you where the visibility gaps are and which fixes will move the needle in 30 days instead of 90. This is why quarterly audits are becoming table-stakes for agencies. Without them, you're making recommendations in the dark.
The 12-dimensional AEO audit framework
Dimension 1: AI Citation Score & Share of Voice
Test 50–100 natural queries your client's prospects would ask. Record how often your client's brand is mentioned in AI responses across all 5 engines (ChatGPT, Perplexity, Gemini, Claude, Google AI Overviews). Compare to visibility in traditional Google Search. This is your baseline metric—the percentage of AI responses that cite your client.
Fix Priority: 1. Identify which engines have zero presence (biggest gap first). 2. Audit top 20 competitor responses on same queries to identify source patterns. 3. Map content gaps—content your competitors have but your client doesn't. Evidence: Conductor (2026), Profound API (2026)
Dimension 2: Engine-Specific Brand-Mention Rates
Test 50–100 queries in each engine separately. Benchmark against per-engine baselines: Claude cites brands at 97.3%, ChatGPT at 73.6%, Perplexity at 54.8% per Profound analysis. Single-engine audits hide the real picture—the same brand can be #1 on Claude and invisible on ChatGPT. This dimension is critical because it surfaces engine preference bias. Some engines favor Wikipedia and brand sites. Others favor Reddit and community content. Understanding where your client is weak per engine tells you exactly which content and authority-building strategy to prioritize.
A composite "AI visibility score" averaged across engines is worthless. Always split by engine. Your reporting to clients should show ChatGPT separately from Perplexity, not as one blended number. Engine-specific strategy is non-negotiable because each engine's source bias is different.
Fix Priority: Grade lowest-performing engine first (biggest ROI on effort). Test 25 competitor queries in that engine to identify source-citation patterns. Build engine-specific content and schema strategy. Evidence: Profound Citation Analysis (2026), SE Ranking (2026)
Dimension 3: Engine Coverage Gap & Invisibility Map
Map which of the 5 engines have zero or near-zero presence. The average agency client shows presence in only 3.3 of 5 engines, leaving a 34% visibility blindspot. Test your client across all 5 engines on the same 10 queries. Track which engines return zero mentions. Grading: A=all 5 engines (0% gap), B=4 of 5 (20% gap), C=3 of 5 (40% gap), D=2 of 5 (60% gap), F=1 or fewer (80%+ gap).
The implications are concrete: if your client is invisible on Perplexity (15–20M US searches/month) and weak on Claude (45M+ searches/month), they're missing a combined 60M+ monthly search opportunities. That's your sales pitch to the client—not just "your visibility is low," but "you're missing 60M monthly opportunities across two engines."
Fix Priority: 1. Audit coverage gap (which engine is worst?). 2. Test 5 high-intent competitor queries in gap engine to reverse-engineer source preferences. 3. Develop engine-specific strategy. Evidence: GenPicked Engine Traffic Estimates (2026)
Dimension 4: Reddit Citation Footprint
Measure what percentage of your client's citations come from Reddit across all 5 engines. Critical for Perplexity: 46.7% of Perplexity's top citations are Reddit posts per Discovered Labs. This dimension matters because Reddit is where community conversations happen—and AI engines are trained to pull from those conversations. Grading: A=15–25% (healthy Reddit presence, not over-dependent), B=25–35% (moderate), C=35–50% (heavy), D=50%+ (dangerously dependent on Reddit volatility), F=0% (missing Perplexity entirely).
Fix Priority: 1. Audit Reddit mention rate in relevant communities. 2. If Perplexity gap is high, build modest Reddit engagement strategy (not viral—just quality community participation). 3. Monitor Reddit sentiment drift quarterly. Evidence: Discovered Labs Perplexity Study (2026), Growth Marshal (2026)
Dimension 5: Schema Markup Audit for AI Engines
Audit whether product/review/FAQ/organization pages use attribute-rich structured data. Test whether schema appears in Google AI Overview responses. Grading: A=90–100% of eligible pages have schema visible in Overviews, B=75–89%, C=60–74%, D=40–59%, F=below 40%. Run pages through Google's Rich Results Test and test sample pages in Google AI Overview preview.
Important caveat: Generic schema performs worse than no schema—only attribute-rich schema (Product, Review, FAQ with pricing/ratings/specs) moves the needle per BrightEdge. A page with vague Organization schema that ranks #1 gets fewer citations than one with Product schema with detailed attributes that ranks #3. Quality of schema matters more than its presence.
Fix Priority: 1. Inventory all product/service/review/FAQ pages. 2. Prioritize top 20 revenue-generating or lead-generating pages. 3. Add or fix schema markup on highest-priority pages. 4. Test in Google AI Overview preview within 2 weeks. Evidence: BrightEdge AI Overview Analysis (2026), Schema.org (2026)
Dimension 6: Domain Authority & Earned Brand-Mention Volume
Combine two metrics: (1) Domain Authority (Ahrefs/Moz), (2) monthly volume of earned brand mentions across trusted sources (news outlets, industry publications, academic journals, authority blogs). Grading: A=DA 50+, 100+ mentions/month from authority sources; B=DA 40–49, 50–100 mentions; C=DA 30–39, 20–50 mentions; D=DA 20–29, below 20 mentions; F=DA below 20.
Brand mentions correlate 0.664 with AI visibility vs 0.218 for backlinks per RivalHound—a 3:1 advantage for mentions. This is the single largest effect driver in the entire audit framework. If you have bandwidth for one lever, this is it. Domain authority acts as a trustworthiness multiplier in AI responses.
Fix Priority: 1. Run current DA and earned-mention audit (use Conductor or Ahrefs). 2. Identify 5–10 trusted authority outlets your buyers read. 3. Develop PR/earned media strategy targeting those outlets. 4. Monitor monthly earned mention volume. Evidence: Ahrefs DA & AI Visibility (2026), RivalHound Earned Mentions Study (2026)
Dimension 7: Content Chunking & AI Citation Hygiene
Measure the average "citation chunk" size (word count of content snippets AI engines pull). Optimal: 50–150 words. Audit structure: headers, lists, paragraphs vs dense walls of text. Grading: A=80%+ of pages in optimal range, B=60–79%, C=40–59%, D=20–39%, F=below 20%. Test pages manually by querying ChatGPT and measuring which sections it cites.
Content chunks of 50–150 words are cited by AI at 3.2× the rate of longer passages per Seer Interactive. Shorter is not always better; denser is not smarter. Optimal extractability wins. Use this as a content rewrite trigger for your lowest-graded pages.
Fix Priority: 1. Audit top 10 client pages; measure citation chunk sizes. 2. Rewrite pages with sub-optimal chunking. 3. Add strategic headers/lists to break up long-form content. 4. Re-test in AI engines within 4 weeks. Evidence: Seer Interactive Content Structure Study (2026), Frase AI Optimization Guide (2026)
Dimension 8: GA4 Attribution & AI Traffic Misclassification
Audit GA4 traffic classification. 60%+ of ChatGPT traffic is misclassified as "Direct" in GA4 per SE Journal. Grading: A=0–10% gap (GA4 correctly tracks 90%+ AI traffic), B=10–25%, C=25–50%, D=50–75%, F=75%+ gap (GA4 captures below 25%).
This creates a false narrative for client reporting: strong AI citation rates but no visible traffic lift in GA4. Agencies under-invest in AEO as a result. Start by building a custom GA4 segment for suspected AI traffic and monitoring Direct traffic spikes in weeks when you know AI citations are high.
Fix Priority: 1. Create custom GA4 segment for AI traffic. 2. Compare Direct traffic spikes to citation increases (manual check). 3. Set up referrer-based detection for known AI domains. 4. Quantify citation-vs-GA4 gap and report it to the client. Evidence: SE Journal GA4 Attribution Analysis (2026)
Dimension 9: Google AI Overviews Coverage & Tracked Query Visibility
Run 50–100 target keywords. For each, check if a Google AI Overview appears. If yes, does it mention the client? Calculate "Overview Appearance Rate" and "Citation Rate Within Overview." Grading: A=80%+ appearance, 70%+ citation; B=60–79% appearance; C=40–59%; D=20–39%; F=below 20%.
Google AI Overviews appear on 50–60% of queries per Conductor and are growing. Citations in Overviews are relatively stable compared to ChatGPT/Perplexity volatility, making this dimension valuable for long-term strategy.
Fix Priority: 1. Audit current Overview appearance and citation rates. 2. Identify query clusters with low citation. 3. Optimize schema markup for Overview-heavy query types. 4. Improve content structure (headers, lists, definitions). Evidence: Conductor AI Overview Metrics (2026), Google Search Central (2026)
Dimension 10: Sentiment & Perception Drift
Analyze how AI engines describe the client's brand in responses. Extract frequently paired adjectives ("premium," "affordable", "innovative"). Compare brand descriptors across engines and against competitors. Grading: A=consistent, positive descriptors, 80%+ alignment across engines; B=60–79% alignment; C=40–59%; D=below 40%; F=negative or irrelevant descriptors.
Fix Priority: 1. Conduct descriptor audit across all 5 engines. 2. Identify negative or misaligned descriptors. 3. Trace descriptor sources to identify where perception gap originated. 4. Build strategy to boost positive descriptor sources. Evidence: ZipTie Brand Perception Study (2026)
Dimension 11: Competitor Citation Overlap & Query Loss Analysis
41% of queries where you rank but don't get cited are won by 2–3 specific competitors per RivalHound. Audit 30–50 high-intent keywords. Map which competitors are cited when your client is absent. Grading: A=10% overlap loss; B=20%; C=30–40%; D=40–60%; F=60%+.
This dimension reveals the exact competitive gap. If the same 2–3 competitors consistently beat you in AI responses across multiple queries, reverse-engineering their content, schema, authority signals, and positioning tells you the exact gaps to close.
Fix Priority: 1. Identify top 3 citation-winning competitors. 2. Reverse-engineer #1 winner (compare content length, structure, schema richness, domain authority, earned mentions). 3. Close largest gap. 4. Re-test in AI engines in 6–8 weeks. Evidence: RivalHound Competitive Loss Analysis (2026)
Dimension 12: Alert Hygiene & Monitoring Configuration
Set up alerts for changes impacting AI visibility: brand mentions, competitor content launches, schema errors, ranking shifts, citation rate changes. Grading: A=6+ alert types configured, weekly reviews, <48hr response SLA; B=4–5 alert types, bi-weekly reviews, 3-day SLA; C=3–4 alert types, monthly reviews; D=1–2 alert types, ad-hoc; F=no alerts.
AEO is volatile. A brand can shift from 80% citation rate to 40% in 2–4 weeks if key content is removed or competitor content goes viral. Monthly audits are too slow. Weekly snapshot testing + alert-driven response is the minimum operational cadence.
Fix Priority: 1. Select 2–3 highest-impact alert types (brand mentions, citation rate drops, competitor new content). 2. Configure tools and set review cadence. 3. Document alert escalation protocol. 4. Add remaining alert types iteratively. Evidence: SE Ranking Monitoring Best Practices (2026)
How to grade, composite, and report the audit
Run all 12 dimensions on each client. Assign each an A-F grade. Weight them by measured impact: Dimensions 1, 6, and 11 are worth 15% each (they drive the most citations). Dimensions 2–5 and 7–10 are 10% each. Dimension 12 is 5%. Calculate a composite score on a 0–100 scale where A=90–100, B=80–89, C=70–79, D=60–69, F=below 60.
A composite above 75 is strong for competitive markets. Below 50 means urgent work needed before the next QBR. Track this score month-over-month and use it as your primary client-reporting metric.
Report this to the client in business terms, not audit jargon. Instead of "Your Domain Authority is 38 and you have 22 earned mentions per month," say: "Your brand is recommended in 6 of 10 AI responses on high-intent queries. Your closest competitor appears in 8 of 10. We found 5 authority publications that cite your competitor but not you—fixing that is our #1 Q3 initiative because each of those earned mentions typically drives 15–25 qualified leads per month."
Draft your first audit report for your worst-performing client this week. Use the 12 dimensions. Focus your first round of fixes on the 3 dimensions with the biggest gaps. Those are your Q3 initiatives. The client will see measurable progress in 4–6 weeks if you prioritize the highest-effect levers (earned mentions, engine coverage, schema). Use that progress in the next quarterly business review to justify continued or expanded investment.
What doesn't belong in the audit (the vendor trap)
llms.txt does not work. SE Ranking tested 300,000 domains and found zero correlation between llms.txt and AI citations. The file is a red herring. If a vendor pitches llms.txt as their #1 lever, ask what else they have.
Single-engine dashboards hide the real picture. A blended "AI visibility score" averaged across all engines tells you nothing actionable. Claude cites brands at 97.3%, ChatGPT at 73.6%—the spread means an averaged number is useless for strategy. Always split results by engine.
Generic schema is worse than no schema. Pages with generic schema were cited at 41.6%, vs 59.8% for no schema, and 61.7% for attribute-rich schema per Growth Marshal. Invest in specific, detailed schema only. Copy-pasting boilerplate JSON-LD is a waste of time.
Start your 14-day free trial
Growth plan free for 14 days. Five AI engines. Full agency dashboard.
Start free trial