The Agency AEO Audit Checklist: What to Grade on Every Client's AI Visibility

Your client's Q2 business review is in three weeks. The VP of Marketing is going to ask: "Are we showing up in ChatGPT? Perplexity? Google's AI thing?" And your agency doesn't have a structured answer. You've run a few manual checks. Some clients look okay. Others are invisible. But you don't have a framework to grade them, compare them to competitors, or present the work with confidence.

This is the grading checklist. It's designed for agencies managing 5-50 client brands across a portfolio. Each audit dimension has a clear A-F rubric, a specific measurement method, and a priority-ranked fix list. Run this on your worst-performing client this week. You'll find at least 3 dimensions where a small structural change will move the needle within 30 days. Use that win in the next quarterly business review. The 12 dimensions together form a composite AEO Citation Score (ACS) that you can track month-over-month and report to clients as a concrete indicator of their AI visibility progress.

Start your 14-day free trial

Start your 14-day free trial

Growth plan free for 14 days. Five AI engines. Full agency dashboard.

Start free trial

Why a structured audit matters right now (3 hero stats)

Three numbers that should sit in your quarterly business review deck with your client:

94%
of B2B buyers use AI during purchase per 6sense (2025)
87.4%
of AI referral traffic from ChatGPT per Conductor (2026)
77%
of brands invisible to AI per Loamly (2026)

Your client's buyer opens ChatGPT or Perplexity in week one of their research. If your client is not on the shortlist that AI generates, the deal is lost before the first sales email. A structured audit tells you where the visibility gaps are and which fixes will move the needle in 30 days instead of 90. This is why quarterly audits are becoming table-stakes for agencies. Without them, you're making recommendations in the dark.

The 12-dimensional AEO audit framework

Dimension 1: AI Citation Score & Share of Voice

Test 50–100 natural queries your client's prospects would ask. Record how often your client's brand is mentioned in AI responses across all 5 engines (ChatGPT, Perplexity, Gemini, Claude, Google AI Overviews). Compare to visibility in traditional Google Search. This is your baseline metric—the percentage of AI responses that cite your client.

A (90–100%)

Cited in 45+ AI queries; citation frequency matches search rank

B (75–89%)

Cited in 30–44 queries; rate within 1–2 positions of search rank

C (60–74%)

Cited in 20–29 queries; 3–5 positions below search rank

D (Below 60%)

Cited in fewer than 20 queries; invisible in 2+ engines

Fix Priority: 1. Identify which engines have zero presence (biggest gap first). 2. Audit top 20 competitor responses on same queries to identify source patterns. 3. Map content gaps—content your competitors have but your client doesn't. Evidence: Conductor (2026), Profound API (2026)

Dimension 2: Engine-Specific Brand-Mention Rates

Test 50–100 queries in each engine separately. Benchmark against per-engine baselines: Claude cites brands at 97.3%, ChatGPT at 73.6%, Perplexity at 54.8% per Profound analysis. Single-engine audits hide the real picture—the same brand can be #1 on Claude and invisible on ChatGPT. This dimension is critical because it surfaces engine preference bias. Some engines favor Wikipedia and brand sites. Others favor Reddit and community content. Understanding where your client is weak per engine tells you exactly which content and authority-building strategy to prioritize.

Key insight

A composite "AI visibility score" averaged across engines is worthless. Always split by engine. Your reporting to clients should show ChatGPT separately from Perplexity, not as one blended number. Engine-specific strategy is non-negotiable because each engine's source bias is different.

Fix Priority: Grade lowest-performing engine first (biggest ROI on effort). Test 25 competitor queries in that engine to identify source-citation patterns. Build engine-specific content and schema strategy. Evidence: Profound Citation Analysis (2026), SE Ranking (2026)

Dimension 3: Engine Coverage Gap & Invisibility Map

Map which of the 5 engines have zero or near-zero presence. The average agency client shows presence in only 3.3 of 5 engines, leaving a 34% visibility blindspot. Test your client across all 5 engines on the same 10 queries. Track which engines return zero mentions. Grading: A=all 5 engines (0% gap), B=4 of 5 (20% gap), C=3 of 5 (40% gap), D=2 of 5 (60% gap), F=1 or fewer (80%+ gap).

The implications are concrete: if your client is invisible on Perplexity (15–20M US searches/month) and weak on Claude (45M+ searches/month), they're missing a combined 60M+ monthly search opportunities. That's your sales pitch to the client—not just "your visibility is low," but "you're missing 60M monthly opportunities across two engines."

Fix Priority: 1. Audit coverage gap (which engine is worst?). 2. Test 5 high-intent competitor queries in gap engine to reverse-engineer source preferences. 3. Develop engine-specific strategy. Evidence: GenPicked Engine Traffic Estimates (2026)

Dimension 4: Reddit Citation Footprint

Measure what percentage of your client's citations come from Reddit across all 5 engines. Critical for Perplexity: 46.7% of Perplexity's top citations are Reddit posts per Discovered Labs. This dimension matters because Reddit is where community conversations happen—and AI engines are trained to pull from those conversations. Grading: A=15–25% (healthy Reddit presence, not over-dependent), B=25–35% (moderate), C=35–50% (heavy), D=50%+ (dangerously dependent on Reddit volatility), F=0% (missing Perplexity entirely).

Fix Priority: 1. Audit Reddit mention rate in relevant communities. 2. If Perplexity gap is high, build modest Reddit engagement strategy (not viral—just quality community participation). 3. Monitor Reddit sentiment drift quarterly. Evidence: Discovered Labs Perplexity Study (2026), Growth Marshal (2026)

Dimension 5: Schema Markup Audit for AI Engines

Audit whether product/review/FAQ/organization pages use attribute-rich structured data. Test whether schema appears in Google AI Overview responses. Grading: A=90–100% of eligible pages have schema visible in Overviews, B=75–89%, C=60–74%, D=40–59%, F=below 40%. Run pages through Google's Rich Results Test and test sample pages in Google AI Overview preview.

Important caveat: Generic schema performs worse than no schema—only attribute-rich schema (Product, Review, FAQ with pricing/ratings/specs) moves the needle per BrightEdge. A page with vague Organization schema that ranks #1 gets fewer citations than one with Product schema with detailed attributes that ranks #3. Quality of schema matters more than its presence.

Fix Priority: 1. Inventory all product/service/review/FAQ pages. 2. Prioritize top 20 revenue-generating or lead-generating pages. 3. Add or fix schema markup on highest-priority pages. 4. Test in Google AI Overview preview within 2 weeks. Evidence: BrightEdge AI Overview Analysis (2026), Schema.org (2026)

Dimension 6: Domain Authority & Earned Brand-Mention Volume

Combine two metrics: (1) Domain Authority (Ahrefs/Moz), (2) monthly volume of earned brand mentions across trusted sources (news outlets, industry publications, academic journals, authority blogs). Grading: A=DA 50+, 100+ mentions/month from authority sources; B=DA 40–49, 50–100 mentions; C=DA 30–39, 20–50 mentions; D=DA 20–29, below 20 mentions; F=DA below 20.

Brand mentions correlate 0.664 with AI visibility vs 0.218 for backlinks per RivalHound—a 3:1 advantage for mentions. This is the single largest effect driver in the entire audit framework. If you have bandwidth for one lever, this is it. Domain authority acts as a trustworthiness multiplier in AI responses.

Fix Priority: 1. Run current DA and earned-mention audit (use Conductor or Ahrefs). 2. Identify 5–10 trusted authority outlets your buyers read. 3. Develop PR/earned media strategy targeting those outlets. 4. Monitor monthly earned mention volume. Evidence: Ahrefs DA & AI Visibility (2026), RivalHound Earned Mentions Study (2026)

Dimension 7: Content Chunking & AI Citation Hygiene

Measure the average "citation chunk" size (word count of content snippets AI engines pull). Optimal: 50–150 words. Audit structure: headers, lists, paragraphs vs dense walls of text. Grading: A=80%+ of pages in optimal range, B=60–79%, C=40–59%, D=20–39%, F=below 20%. Test pages manually by querying ChatGPT and measuring which sections it cites.

Content chunks of 50–150 words are cited by AI at 3.2× the rate of longer passages per Seer Interactive. Shorter is not always better; denser is not smarter. Optimal extractability wins. Use this as a content rewrite trigger for your lowest-graded pages.

Fix Priority: 1. Audit top 10 client pages; measure citation chunk sizes. 2. Rewrite pages with sub-optimal chunking. 3. Add strategic headers/lists to break up long-form content. 4. Re-test in AI engines within 4 weeks. Evidence: Seer Interactive Content Structure Study (2026), Frase AI Optimization Guide (2026)

Dimension 8: GA4 Attribution & AI Traffic Misclassification

Audit GA4 traffic classification. 60%+ of ChatGPT traffic is misclassified as "Direct" in GA4 per SE Journal. Grading: A=0–10% gap (GA4 correctly tracks 90%+ AI traffic), B=10–25%, C=25–50%, D=50–75%, F=75%+ gap (GA4 captures below 25%).

This creates a false narrative for client reporting: strong AI citation rates but no visible traffic lift in GA4. Agencies under-invest in AEO as a result. Start by building a custom GA4 segment for suspected AI traffic and monitoring Direct traffic spikes in weeks when you know AI citations are high.

Fix Priority: 1. Create custom GA4 segment for AI traffic. 2. Compare Direct traffic spikes to citation increases (manual check). 3. Set up referrer-based detection for known AI domains. 4. Quantify citation-vs-GA4 gap and report it to the client. Evidence: SE Journal GA4 Attribution Analysis (2026)

Dimension 9: Google AI Overviews Coverage & Tracked Query Visibility

Run 50–100 target keywords. For each, check if a Google AI Overview appears. If yes, does it mention the client? Calculate "Overview Appearance Rate" and "Citation Rate Within Overview." Grading: A=80%+ appearance, 70%+ citation; B=60–79% appearance; C=40–59%; D=20–39%; F=below 20%.

Google AI Overviews appear on 50–60% of queries per Conductor and are growing. Citations in Overviews are relatively stable compared to ChatGPT/Perplexity volatility, making this dimension valuable for long-term strategy.

Fix Priority: 1. Audit current Overview appearance and citation rates. 2. Identify query clusters with low citation. 3. Optimize schema markup for Overview-heavy query types. 4. Improve content structure (headers, lists, definitions). Evidence: Conductor AI Overview Metrics (2026), Google Search Central (2026)

Dimension 10: Sentiment & Perception Drift

Analyze how AI engines describe the client's brand in responses. Extract frequently paired adjectives ("premium," "affordable", "innovative"). Compare brand descriptors across engines and against competitors. Grading: A=consistent, positive descriptors, 80%+ alignment across engines; B=60–79% alignment; C=40–59%; D=below 40%; F=negative or irrelevant descriptors.

Fix Priority: 1. Conduct descriptor audit across all 5 engines. 2. Identify negative or misaligned descriptors. 3. Trace descriptor sources to identify where perception gap originated. 4. Build strategy to boost positive descriptor sources. Evidence: ZipTie Brand Perception Study (2026)

Dimension 11: Competitor Citation Overlap & Query Loss Analysis

41% of queries where you rank but don't get cited are won by 2–3 specific competitors per RivalHound. Audit 30–50 high-intent keywords. Map which competitors are cited when your client is absent. Grading: A=10% overlap loss; B=20%; C=30–40%; D=40–60%; F=60%+.

This dimension reveals the exact competitive gap. If the same 2–3 competitors consistently beat you in AI responses across multiple queries, reverse-engineering their content, schema, authority signals, and positioning tells you the exact gaps to close.

Fix Priority: 1. Identify top 3 citation-winning competitors. 2. Reverse-engineer #1 winner (compare content length, structure, schema richness, domain authority, earned mentions). 3. Close largest gap. 4. Re-test in AI engines in 6–8 weeks. Evidence: RivalHound Competitive Loss Analysis (2026)

Dimension 12: Alert Hygiene & Monitoring Configuration

Set up alerts for changes impacting AI visibility: brand mentions, competitor content launches, schema errors, ranking shifts, citation rate changes. Grading: A=6+ alert types configured, weekly reviews, <48hr response SLA; B=4–5 alert types, bi-weekly reviews, 3-day SLA; C=3–4 alert types, monthly reviews; D=1–2 alert types, ad-hoc; F=no alerts.

AEO is volatile. A brand can shift from 80% citation rate to 40% in 2–4 weeks if key content is removed or competitor content goes viral. Monthly audits are too slow. Weekly snapshot testing + alert-driven response is the minimum operational cadence.

Fix Priority: 1. Select 2–3 highest-impact alert types (brand mentions, citation rate drops, competitor new content). 2. Configure tools and set review cadence. 3. Document alert escalation protocol. 4. Add remaining alert types iteratively. Evidence: SE Ranking Monitoring Best Practices (2026)

How to grade, composite, and report the audit

Run all 12 dimensions on each client. Assign each an A-F grade. Weight them by measured impact: Dimensions 1, 6, and 11 are worth 15% each (they drive the most citations). Dimensions 2–5 and 7–10 are 10% each. Dimension 12 is 5%. Calculate a composite score on a 0–100 scale where A=90–100, B=80–89, C=70–79, D=60–69, F=below 60.

A composite above 75 is strong for competitive markets. Below 50 means urgent work needed before the next QBR. Track this score month-over-month and use it as your primary client-reporting metric.

Report this to the client in business terms, not audit jargon. Instead of "Your Domain Authority is 38 and you have 22 earned mentions per month," say: "Your brand is recommended in 6 of 10 AI responses on high-intent queries. Your closest competitor appears in 8 of 10. We found 5 authority publications that cite your competitor but not you—fixing that is our #1 Q3 initiative because each of those earned mentions typically drives 15–25 qualified leads per month."

Do this

Draft your first audit report for your worst-performing client this week. Use the 12 dimensions. Focus your first round of fixes on the 3 dimensions with the biggest gaps. Those are your Q3 initiatives. The client will see measurable progress in 4–6 weeks if you prioritize the highest-effect levers (earned mentions, engine coverage, schema). Use that progress in the next quarterly business review to justify continued or expanded investment.

What doesn't belong in the audit (the vendor trap)

llms.txt does not work. SE Ranking tested 300,000 domains and found zero correlation between llms.txt and AI citations. The file is a red herring. If a vendor pitches llms.txt as their #1 lever, ask what else they have.

Single-engine dashboards hide the real picture. A blended "AI visibility score" averaged across all engines tells you nothing actionable. Claude cites brands at 97.3%, ChatGPT at 73.6%—the spread means an averaged number is useless for strategy. Always split results by engine.

Generic schema is worse than no schema. Pages with generic schema were cited at 41.6%, vs 59.8% for no schema, and 61.7% for attribute-rich schema per Growth Marshal. Invest in specific, detailed schema only. Copy-pasting boilerplate JSON-LD is a waste of time.

Start your 14-day free trial

Start your 14-day free trial

Growth plan free for 14 days. Five AI engines. Full agency dashboard.

Start free trial

Joseph K. Banda

Co-Founder, GenPicked

Building the AEO platform for marketing agencies. Helping agency owners get their clients cited by ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews — and prove it with data.

Credentials:

Co-Founder, GenPicked, AEO Citation Score (ACS) framework architect, GenPicked Agency Audit (Q2 2026)

Frequently Asked Questions

What is an AEO audit?

An AEO (Answer Engine Optimization) audit measures whether your client's brand is cited by ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews across a set of target queries. It grades citation frequency, engine-specific visibility, schema markup quality, domain authority, earned mentions, content structure, and competitive positioning. The audit produces a 0-100 ACS (AEO Citation Score) and a prioritized list of fixes. It differs from SEO audits because AI engines and Google rank differently—a brand can be #1 on Google but invisible to ChatGPT, or vice versa.

How often should agencies run AEO audits on their clients?

Run a full audit quarterly (every 13 weeks). Run a snapshot audit (citation rate test + Overview check) monthly. Run emergency audits anytime there's positive news about the client, a competitor launch, or an unexpected traffic anomaly. AEO is volatile—citation rates can shift 20–30% in 2–4 weeks if a competitor publishes viral content or key content is removed. Monthly tracking is the minimum for risk management.

What's the most important AEO audit dimension?

Dimension 6: Domain Authority & Earned Brand-Mention Volume. Per RivalHound, brand mentions in trusted sources correlate 0.664 with AI visibility, while traditional backlinks correlate only 0.218—a 3:1 advantage for mentions. If you have bandwidth for one lever before the next QBR, invest here: identify 5–10 trusted authority outlets your prospects read, develop a PR/earned media strategy, and track monthly mention volume. The ROI on this single dimension outweighs schema fixes, content rewrites, and Reddit strategy combined.

Should AEO audits split by engine or use a single composite score?

Always split by engine. A composite score averaged across ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews hides the real picture. Per Profound, Claude cites brands at 97.3%, ChatGPT at 73.6%, Perplexity at 54.8%—a 43-point spread. The same brand can be category-leader in Claude but invisible in Perplexity. Your audit report and your client reporting must show each engine separately, with engine-specific strategy recommendations.

Does FAQ schema actually move the audit grade?

Yes, but less than vendor marketing claims. Per Frase, pages with FAQPage markup are 3.2× more likely to appear in Google AI Overviews. Per AI Boost, pages with FAQ schema plus inline citations are weighted 40% higher in ChatGPT source selection. However, per ZipTie, domain authority outweighs schema by 3.5:1 in overall citation probability. And per Growth Marshal, generic, copy-paste schema performs worse than no schema at all. Invest in attribute-rich Product, Review, and FAQ schema only—not generic Article schema.

How do I grade Reddit footprint without spending hours per client?

Use the simple method: (1) Run 25 high-intent queries in Perplexity. (2) For each response, count how many of the top 3 sources are Reddit posts. (3) Calculate Reddit citation %. Per Discovered Labs, 46.7% of Perplexity's top citations are Reddit. If your client is at 10%, they're under-indexed on Perplexity. If they're at 50%+, they're dangerously dependent on Reddit volatility. Most clients land in the 20–35% range. Assign a grade (A–F) based on the 12-audit framework and move on. Don't spend hours manually monitoring—use this as a quarterly benchmark.

How does AEO audit grading differ from a traditional SEO audit?

SEO audits measure ranking position and backlink profile. AEO audits measure citation rate and source diversity. SEO audits assume Google and traditional search. AEO audits assume 94% of B2B buyers use AI during purchase (per 6sense). In SEO, ranking #1 is the win. In AEO, being cited in the response matters more than your Google rank. A page at position 11 in Google that gets cited in 8 of 10 ChatGPT responses is outperforming a page at position 1 that gets cited in 0 of 10. They're separate ranking systems with separate measurement frameworks.

Can I white-label the audit report for my client deliverables?

If you use GenPicked's platform, yes—the white-label feature (available on Growth and Scale plans) produces agency-branded reports showing the 12 audit dimensions, composite ACS score, engine-by-engine breakdown, competitive positioning, and recommended fixes. The report can include your agency logo and color scheme. For manual audits, you'll need to create your own template, but the 12-dimension framework translates directly into a client-ready scorecard format. Either way, report in business language, not audit jargon.

What grade should I aim for?

Composite score above 75 is strong. 50–74 is competitive but needs work. Below 50 means urgent improvement needed before the next quarterly business review. For individual dimensions, aim for B or better (75+ range). If a client has an A in Domain Authority but a D in Schema Markup, you have a clear priority: fix schema first (it's the low-hanging fruit). Track month-over-month improvement; even a 5–10 point ACS increase is worth reporting because it demonstrates progress.

What if my client scores low across the board—where do I start?

Prioritize in this order: (1) Domain Authority & Earned Mentions (Dimension 6)—this is 3:1 more impactful than schema. (2) Engine Coverage Gap (Dimension 3)—identify which of the 5 engines has zero presence and reverse-engineer that engine's source preferences. (3) Schema Markup (Dimension 5)—fix the 5–10 highest-revenue pages with attribute-rich Product/Review schema. (4) Content Chunking (Dimension 7)—restructure one pillar page to 50–150 word sections with Q&A headings. You can show measurable progress in 4 weeks with just these four. Use that momentum for the bigger earned-mention strategy in weeks 5–12.

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit
#aeo#ai-visibility#agency-playbook#audit#answer-engine-optimization#agency-ops#client-reporting