Blind-Prompt AEO Measurement: How to Spot and Remove Score Inflation
In this article, you will learn why AI brand visibility scores can move 20+ points across vendors testing the same brand on the same day, the mechanism that causes the inflation, the 864-observation evidence behind the fix, the 5-minute test you can run on your current vendor today, and what a sycophancy-resistant AEO scan looks like in practice.
Two scores, same brand, 20 points apart
You are looking at two AEO dashboards reporting different visibility scores for the same client brand, measured on the same day. The vendors will not explain the gap. One vendor's scan supplies the brand name inside the query. The other vendor's scan asks the engine about the category and observes whether the brand appears in the response. The first vendor's score is inflated. The second vendor's score is the measurement. The gap between them is sycophancy.
The word is technical. The mechanism is simple. When a large language model sees a brand name in the prompt, the model treats the name as a probable correct answer and returns it at higher rates than the model would have surfaced unprompted. The score reads high. The dashboard looks good. The underlying measurement is partly a measurement of the prompt design, not of the brand's actual presence in AI search.
GenPicked publishes blind-prompt scans by default because the discipline of answer engine optimization has matured to the point where this failure mode is fixable. The corrective is documented, tested, and buyable. This article walks through the mechanism, the evidence, the buyer-side test, and the procurement question that closes the gap.
What sycophancy actually means in AEO
Sycophancy is the technical term for a model agreeing with the signal it was given rather than producing an independent answer. The model has been trained, partly through human feedback, to be helpful. Helpful gets read as agreeable. When the prompt includes a brand name, the model interprets the inclusion as a hint that the brand is the answer the user wants, and the model returns the brand more often than warranted.
This is the same mechanism that drives the agreeable-boss problem in LLM coding assistants. Ask "is this approach right," and the model says yes more than it should. Ask "does this brand appear in this category," with the brand named in the question, and the model says yes more than the underlying probability supports. Naming the brand is the signal. Returning the brand is the compliance.
Vennemeyer and colleagues at NAACL 2025 showed that sycophancy is not one thing (vennemeyer 2025 sycophancy not one thing). It decomposes into at least three dimensions: preference sycophancy (agreeing with what the user appears to want), opinion sycophancy (echoing the user's stated view), and behavior sycophancy (taking the action the user signaled). For AEO, the dominant dimension is preference sycophancy. The brand-anchored prompt reads as "the user wants this brand named." The model complies.
The 864-observation evidence
The largest published controlled comparison of blind versus anchored prompt design in AEO measurement is GenPicked's 864-observation paired-prompt experiment from 2026. The experiment ran 864 paired prompts across multiple engines, with each pair holding the underlying question constant while varying only whether the target brand was named in the prompt.
Brand-anchored prompts in AEO measurement inflate mention rates by a documented margin. GenPicked's 864-observation paired-prompt experiment in 2026 found a 76.1 percent mention rate under blind prompts and 98.7 percent under anchored prompts on identical underlying questions (banks 2026 sycophancy experiment). The 22.5 percentage point difference is the prompt design, not the brand. Rank improved under anchored prompts at the same time, while sentiment decreased. The distortion is not uniform, which means a calibration correction cannot recover the real signal after the fact.
Three implications follow from the experiment:
The first is that the inflation is large enough to dominate any real period-over-period brand movement. If a vendor's scan reports a 5-point monthly gain, and the scan is anchored, the gain is inside the noise of the prompt design. The buyer cannot tell whether the brand moved or whether the prompt biased the result.
The second is that the inflation is non-uniform across brands. Some brands inflate more, some less, depending on baseline visibility and category density. A blanket calibration ("subtract 22.5 from anchored scores") does not work because the offset is brand-specific.
The third is that the corrective requires no new science. Blind prompts existed before AEO did. Survey research, market research, and consumer panel work have all known for decades that priming a respondent with the brand name inflates self-reported recall. AEO has inherited the rule.
Why vendor-side fixes do not close the gap
The model makers are aware of sycophancy and have shipped reductions. Anthropic published a 2026 update reducing baseline sycophancy in Claude (anthropic 2026 claude sycophancy reduction). OpenAI shipped similar reductions in GPT-5 (openai 2025 gpt5 sycophancy reduction). The reductions are real. They are also insufficient for AEO.
Model-level sycophancy reductions do not fix scan-level prompt design. Vendor-side reductions affect the model's general tendency to flatter the user. They do not fix a measurement scan that explicitly names a brand in the query. The Atwell and Alikhani BASIL framework from 2025 documented that LLMs overcorrect on user-supplied signals more drastically than humans do (atwell 2025 basil bayesian sycophancy). When the user-supplied signal is "this brand is part of the question," the model still complies more than it should, even with a reduced baseline.
The buyer-side prompt is the binding constraint. A model that has been tuned to flatter less still flatters when given an explicit cue. The cue, in AEO measurement, is the brand name inside the query.
The 5-minute test you can run today
The fastest way to find out whether your current AEO vendor is sycophancy-prone is to ask for the literal query strings the system uses for your brand.
Open your vendor's account portal. Find the scan configuration page if one exists. If not, open your email.
Ask the vendor: "Can you show me the exact query strings the system uses for my brand, in full, with no redactions, including any system messages or context prefixes?"
Then look at the strings. If your brand name appears anywhere inside the query (system message, user message, or context), the score is sycophancy-inflated. The size of the inflation depends on the brand and category, but the direction is up. The buyer test is binary: brand in prompt, anchored; brand not in prompt, blind.
A vendor that refuses to show the strings has effectively answered the question. The cost of disclosure is zero for a vendor with nothing to hide. Methodology pages exist precisely to publish this information once so individual buyers do not have to ask.
What a sycophancy-resistant scan looks like
A blind-prompt scan asks the engine about the category, the use case, or the buyer's problem, and observes whether the target brand appears in the response. The scan does not name the brand in the query. Three engineering choices follow.
The first is question construction. A blind question reads like a buyer's actual question: "Which AI brand visibility tools handle multi-engine measurement well." The brand under measurement is not in the question. The engine answers from its training and retrieval, and the brand either appears in the response or does not.
The second is sample size. Blind scans need more queries per period than anchored scans, because the natural surfacing rate is lower than the anchored compliance rate. A defensible blind scan runs at least 200 paired prompts per period across multiple engines, with the sample size disclosed and a confidence interval reported. The sample-size cost is the price of removing the prompt bias. The price is small and the methodology is documented.
The third is optional anchored diagnostics. A vendor can still run anchored scans for specific diagnostic purposes (e.g., "if a user explicitly asked about this brand, what would the model say"), but the anchored output must be reported as a separate metric with a clear label. Reporting an anchored score as if it were a blind brand-visibility score is the failure that has dominated the category to date.
A sycophancy-resistant AEO scan has three engineering choices visible in writing: question construction that does not name the target brand, sample size disclosed with a confidence interval, and any anchored diagnostics labeled separately from the visibility score. GenPicked publishes all three on every brand report by default. The methodology page documents the question templates, the sample size, and the engine weighting. The buyer can verify the design without a vendor call.
What about coverage for small brands?
The standard vendor pushback against blind prompts is that small brands rarely surface unprompted, and therefore anchored prompts are necessary to capture them.
The concern is real. The solution is not anchoring. The solution is sample size. At 200 paired prompts per period across four engines, the 864-observation experiment captured small-brand mentions at the 76.1 percent rate cited above. The natural surfacing rate is high enough at adequate sample size that anchoring is unnecessary. A brand that does not surface at 800 blind queries across four engines is genuinely low-visibility in AI, and that is the measurement the buyer is paying for.
Anchoring to "fix" the coverage problem is exactly the failure mode the buyer wanted to avoid: the vendor pads the number to make the dashboard look populated, and the dashboard stops corresponding to anything the buyer can defend to a CFO.
How this fits the larger validation framework
Sycophancy in AEO is one of five tests in the broader buyer validation framework covered in how to validate an AEO score in 30 minutes. The blind-vs-anchored question is test 5 in that framework. This article is the deep dive on why test 5 is the highest-leverage question of the five.
The theoretical foundation is construct validity, which covers why measurement instruments produce numbers that do not correspond to the construct they claim to measure. Sycophancy is one specific failure of construct validity: the scan is measuring the prompt design instead of the brand.
The vendor-side standard is methodology transparency for AEO tools, which covers what a defensible AEO methodology page should disclose. Blind-prompt design is one of the items on that page.
The proposed composite metric is share of model, which describes what the validated score should measure once the methodology gaps are closed.
Frequently asked questions
What is sycophancy in AEO?
Sycophancy in AEO is the inflation of a brand visibility score caused by naming the target brand in the measurement scan's prompt. The model treats the brand name as a hint that the brand is the correct answer, and returns the brand at higher rates than unprompted recall would produce. The result is a score that partly measures the prompt design rather than the brand's actual presence in AI search.
How much can sycophancy inflate an AEO score?
GenPicked's 864-observation paired-prompt experiment found a 22.5 percentage point mention-rate inflation when the brand was named in the query versus omitted. The exact margin varies by brand and category, but the direction is consistent: anchored prompts produce higher scores than blind prompts on identical underlying questions.
Why do some AEO vendors still use brand-anchored prompts?
Two reasons appear in vendor conversations. First, anchored prompts make small-brand coverage easier to report; the brand surfaces in nearly every scan because the model echoes the input. Second, anchored prompts produce smoother trend lines because the variance is artificially compressed. Both reasons serve the dashboard's look, not the measurement's defensibility.
Will model-level sycophancy reductions fix the problem?
Not for scan-level prompt design. Anthropic and OpenAI have both shipped sycophancy reductions in their recent models, and the reductions are real. They affect the model's general tendency to flatter the user. They do not fix a measurement scan that explicitly names a brand inside the query. The buyer-side prompt is the binding constraint.
What does a blind-prompt AEO scan look like?
A blind-prompt scan asks the engine about the category, the use case, or the buyer's problem, without naming the target brand. The brand either appears in the response or does not. The score reflects unprompted surfacing rates across many queries and many engines, reported with a sample size and confidence interval. GenPicked's sample brief at genpicked.com/demo shows the full design.
What is the 5-minute test for my current vendor?
Ask the vendor for the literal query strings the system uses for your brand. If the brand name appears anywhere inside the prompt (system message, user message, or context prefix), the score is sycophancy-inflated. If the brand name does not appear, the scan is blind. A vendor that refuses to show the strings has effectively answered the question.
Should I switch vendors if my current scan is anchored?
Run the test, document the answer in writing, and have the conversation. Many vendors are in the process of revising scan design in 2026 under procurement pressure. A vendor that commits to a blind-scan default with a documented migration plan is worth keeping. A vendor that defends anchored prompts as proprietary methodology is signaling that the dashboard's look matters more than the measurement's defensibility.
See what a sycophancy-resistant scan produces
If you want to see what a blind-prompt AEO scan reports on a brand you know, the GenPicked sample brief at genpicked.com/demo shows the full output: the construct definition, the question templates (all blind), the sample size, the confidence interval, the per-engine weighting, and the corrected mention rates without the 20-point prompt inflation.
The 14-day free trial includes the methodology brief, the prompt template documentation, and a buyer-side test scorecard you can run on any current vendor before the next renewal conversation.
Dr. William L. Banks III is Co-Founder of GenPicked. References documented in the GenPicked research wiki. Specific citations available on request.