The Procurement Checklist for AI Visibility Vendors: 13 Questions to Demand in Writing

The Procurement Checklist for AI Visibility Vendors: 13 Questions to Demand in Writing

In this article, you will learn what to put on a procurement checklist before signing an AI visibility vendor, which questions separate measurement-grade tools from dashboard wrappers, what evidence a credible vendor produces on request, and which answers should end the meeting.


Why this checklist exists

Twenty-seven vendors are competing for AI visibility budget as of mid-2026. Profound was valued at one billion dollars by Fortune. AthenaHQ took 2.2 million dollars in seed funding in June 2025. The category is moving fast, and most procurement teams are evaluating it for the first time. Buyers are asking the wrong questions because the right questions have not been written down anywhere in one place.

This is that place. Below is a 13-item checklist a CMO or agency owner can hand to a vendor before a renewal, before a first contract, or before an RFP closes. Every item asks for a specific answer in writing. Every item names the red flag answer that should end the conversation. The checklist is methodology-first because the underlying signal is noisy. Fishkin and O'Donnell ran 2,961 identical prompts through ChatGPT, Claude, and Google AI in early 2026 and found that fewer than one percent produced the same brand list (Fishkin and O'Donnell, 2026). A vendor that does not address that volatility in its methodology is selling a number, not a measurement.

The checklist sits alongside three deeper pieces. The argument for why methodology disclosure matters at all lives in our methodology transparency article. The mathematics behind defensible ranking lives in the pairwise ranking explainer. The definition of the underlying metric lives in our Share of Model primer. This piece is the operational layer on top of those three.


The 13-item due-diligence checklist

1. How do you turn raw AI responses into a ranking?

Ask the vendor to describe in plain prose how a single prompt run becomes a position on a leaderboard. The answer should distinguish absolute ranking (sort brands by mention count) from pairwise ranking (compare brands two at a time and aggregate the wins). It should name how the vendor mitigates position bias, because items in slots one through three of an AI response get attention items in slots eight through ten do not (position-bias).

Right answer: A written description that names the ranking method, references how position effects are handled (counterbalancing, pairwise aggregation, or randomized slot ordering), and states the trade-offs of the chosen approach.

Red flag: "Our algorithm is proprietary." A measurement whose math cannot be described is a marketing number.

2. Are your measurement prompts blind to the target brand?

Naming a brand in the prompt inflates that brand's apparent visibility by a large and measurable margin (Atwell and Alikhani, 2025). A vendor that prompts the engine with the client's brand named in the question is producing a sycophancy artifact, not a visibility measurement.

A controlled study of named vs. blind prompt design across five frontier engines found that brand-named prompts inflated the target brand's apparent recommendation rate by over twenty percentage points compared to category-only prompts. The effect is large, systematic, and replicates across engines (Atwell and Alikhani, 2025).

Right answer: Prompts describe the buyer scenario and the category, never the target brand. The vendor can produce sample prompts on request. See blind versus named measurement for the underlying principle.

Red flag: "We prompt the engine with your brand and competitors to see who it picks." That is sycophancy on a dashboard.

3. Which AI engines do you query, and how are they weighted?

Different engines surface different brands. A score that uses only ChatGPT is reporting a ChatGPT-specific ranking. A composite score requires an explicit engine-weighting choice, and that choice is editorial, not neutral.

Right answer: A named list of engines (typically four or more, covering ChatGPT, Claude, Gemini, Perplexity), a stated weighting, and the rationale for the weighting tied to traffic share or buyer behavior data.

Red flag: Refusal to name the engines, refusal to disclose weights, or a weighting that has changed in the last 90 days without a documented changelog.

4. What is the per-period sample size for each measurement?

Sample size determines whether a month-over-month rank change is signal or noise. A scan that runs three or five prompts has high variance because each prompt is a single noisy draw from a stochastic system.

Right answer: A specific number per pair (for pairwise designs) or per query (for absolute designs), reported in the dashboard or available on request. The realistic cost structure for a defensible measurement runs into tens of thousands of model calls per measurement period. Anything substantially below that is not statistically adequate. See the pairwise ranking explainer for the math behind the sample-size requirement.

Red flag: No stated sample size, or a sample size that varies week to week without disclosure.

5. Do you report confidence intervals or uncertainty estimates?

A rank of four with overlapping confidence intervals against ranks three and five is genuinely a rank of three to five. A dashboard that reports point estimates without uncertainty is hiding the noise band inside a precise-looking number.

Confidence intervals on AI visibility ranks are not optional once sample-size variance is documented. The Fishkin and O'Donnell 2026 study showed that fewer than one in 1,000 repeated prompts produced the same list in the same order, which means any point-estimate rank that ignores variance is overstating its own precision (Fishkin and O'Donnell, 2026).

Right answer: Reported confidence intervals, error bars, or explicit uncertainty bands on every position estimate.

Red flag: Single integer ranks with no uncertainty bound, presented as authoritative.

6. Can you reproduce the same brand's score within five percent on a re-run?

This is the replication test. Run the same measurement twice in the same week. A defensible method produces a score within a tight band on both runs. A method that produces wildly different numbers on identical inputs is reporting noise.

Right answer: Replication evidence on request, ideally a chart showing test-retest variance within a stated tolerance.

Red flag: No replication evidence, or replication variance exceeding the stated measurement-period change the vendor reports as meaningful.

7. How do you handle cross-engine disagreement?

ChatGPT, Claude, and Gemini disagree about which brand is strongest in most categories. A composite score has to make an editorial choice about how to handle that disagreement. The choice is defensible only when it is disclosed.

Right answer: The vendor reports per-engine scores in addition to the composite, and the dashboard shows where engines disagree. Buyers can see whether the brand is strong everywhere or strong on one engine and weak elsewhere.

Red flag: A single composite score with no per-engine view available.

8. How is the category defined, and who defined it?

Rankings change meaning when the category set changes. Add a new brand to the tracked list and existing brands shift positions for reasons unrelated to actual brand strength. The category-definition call is judgment work, not statistics (construct-validity).

Right answer: A documented category-definition rubric, a named human who made the call, and a process for revising it when the category evolves. See Bean, 2024 on construct validity in measurement.

Red flag: "We use AI to define the category." The category-definition call is exactly the place where automation off-loads strategic judgment onto an opaque process.

9. Is there a methodology changelog?

Methodologies evolve. A vendor that changes prompt templates, engine weights, or scoring formulas without a public changelog is producing month-over-month numbers that are not comparable. The change itself is fine. The lack of disclosure is the problem.

Right answer: A dated changelog accessible to clients, with version numbers attached to historical reports. Buyers can tell whether a rank change reflects brand movement or methodology movement.

Red flag: No changelog, or "we improve the model continuously" as the entire answer.

10. Are you also a media network, an agency, or otherwise compensated to surface specific brands?

The conflict of interest question. A measurement vendor that also runs sponsored placements, paid AI citation programs, or affiliate deals has structural incentive to report numbers that favor paying customers. The conflict can be managed, but only if it is disclosed.

Right answer: A written disclosure of every revenue stream, an explicit firewall between measurement and any commercial relationship with tracked brands, and a willingness to name which tracked brands are also customers of other vendor services.

Red flag: Evasion, or "we are vendor-neutral" without a structural firewall described.

11. What does the contract say about data ownership and portability?

If the contract ends, what can the buyer take with them? The raw measurement data should belong to the buyer, exportable in a standard format. Anything less locks the buyer into the vendor's interpretation layer.

Right answer: A clause granting the buyer full export rights to raw scan data (prompts run, responses received, scores assigned), in a portable format such as CSV or JSON, on request and on contract end.

Red flag: Data is "available through the dashboard" only, with no export path, or export limited to summary aggregates rather than raw measurements.

12. How does your own brand perform on the metric you sell?

The self-test. A vendor selling AI visibility measurement should be findable when buyers ask AI engines about AI visibility vendors. If the vendor's own brand does not appear in the recommendations its product is supposed to optimize, the buyer should ask why.

Asking five frontier AI engines a category-defining query ("what are the best AI visibility measurement tools for agencies?") and tracking which vendors surface in the responses is a free 15-minute due-diligence step. A vendor that scores zero on its own category-defining query is not necessarily a fraud, but the gap is worth a conversation (share-of-model).

Right answer: The vendor scores meaningfully on its own product category across multiple engines, or has a credible explanation for why it does not (recent launch, deliberate positioning, etc.).

Red flag: The vendor is invisible on its own category query and cannot explain why.

13. Can the vendor produce customer references with actual numbers?

Testimonials are easy. Numbers are hard. A credible reference case shows the metric the customer started at, the actions taken, the metric the customer ended at, and the time elapsed. Anything less is a quote on a website.

A defensible customer case study reports a baseline measurement, a documented intervention (content investment, earned-media work, methodology change), a follow-up measurement, and the time between them. The pattern matches the pre-test/post-test design that has been the standard for evaluating educational and clinical interventions for over fifty years.

Right answer: Two or three customer references willing to take a 20-minute call, with specific before/after numbers documented in advance.

Red flag: "We respect customer confidentiality, so we cannot share specifics." Confidentiality is real, but a vendor with strong cases can always find at least one customer willing to be named with permission.


How to use this checklist in a real procurement

Send the 13 questions to every short-listed vendor at the same time, in writing, with a deadline. Score each answer on a three-point scale (full answer, partial answer, no answer or red flag). The vendor with the most full answers is not automatically the right pick, but the vendor with three or more red flags should not advance.

Two operational notes. First, the questions are independent, so the order of asking does not matter. Second, the answers are negotiating documents. If a vendor produces a thin answer to question three (engine weighting) but a full answer to question four (sample size), that is a starting point for the conversation, not a disqualification. Real procurement separates "cannot answer" from "would prefer not to answer in a sales call." The first is a red flag. The second is a negotiation lever.

The pattern that emerges across most vendor evaluations is consistent. Two or three of the 13 items will produce strong answers across the entire short list. Two or three will produce weak answers across the entire short list. The remaining seven or eight are where vendors differentiate. Those middle items are the procurement decision.

For agencies running this evaluation on behalf of a client, the deliverable is a one-page scorecard summarizing the 13 items by vendor, with the source documents (vendor emails, methodology pages, sample prompts) linked in an appendix. That scorecard becomes the renewal-defense document twelve months later when the client asks why the agency picked this vendor over the others. See our methodology audit walkthrough for the operational version of this process.


Frequently asked questions

What is AEO vendor due diligence?

AEO vendor due diligence is the procurement process of evaluating an AI visibility measurement tool against a written methodology standard before signing or renewing. It treats AI visibility measurement as a measurement-grade procurement category, not a marketing-tool category, and asks the vendor to disclose how the numbers are produced. The 13-item checklist above is one operational form of this process.

How long does this evaluation take?

Two to four weeks elapsed time for most short lists. The vendor side takes about a week to produce written answers. The buyer side takes another week to score the answers and run reference calls. The remaining time is reading the methodology pages and drafting the scorecard. Most of the work is asynchronous on the vendor side, so the buyer's active hours are typically eight to twelve.

What if no vendor passes all 13 items?

That is the common case as of mid-2026. The category is young, and most vendors do not yet publish full methodology documentation. The right move is to weight the items by importance (questions one, two, three, five, and ten matter more than seven and nine for most use cases) and pick the vendor with the strongest answers on the weighted top five. Re-evaluate every twelve months because the category is maturing fast. Our position on AEO critics frames the maturity gap and what to expect next.

Should we run this checklist on incumbent vendors at renewal?

Yes, and the answer is often more informative than running it on new vendors. An incumbent that has been comfortable with a thin methodology disclosure for twelve months is a renewal risk once the buyer's procurement standards rise. The renewal conversation is the right moment to upgrade the standard for both sides.

Can we share this checklist with vendors directly?

Yes. Several agencies and CMOs are already sending this list verbatim to vendors as part of RFP responses. The checklist is most useful when the vendor knows in advance what questions are coming, because the vendor can produce thorough written answers rather than improvising on a sales call. Transparency on the buyer side produces better answers from the vendor side.

How is this different from a SOC 2 or security review?

A SOC 2 or security review covers how the vendor handles data. The 13-item checklist covers how the vendor produces measurements. Both reviews are necessary. Buyers running an AI visibility vendor evaluation should add the security review to the procurement track in parallel with the methodology track. The two reviews do not overlap, and the vendor team that handles one is typically not the team that handles the other.


Related reading


Run the checklist on us first

The 13 items above apply to GenPicked as much as to any other vendor in the category. Send the list. We will answer in writing within seven days, with sample prompts, engine weights, methodology changelog, and replication evidence attached.

Run a free GenPicked AEO audit on your brand before you commit to any vendor, including this one.

Start your 14-day free trial of GenPicked Growth


Dr. William L. Banks III is Founder of GenPicked. The procurement checklist above derives from twelve months of vendor evaluations conducted on behalf of agency and in-house buyers, cross-referenced against the construct-validity literature (Bean, 2024), the LLM sycophancy literature (Atwell and Alikhani, 2025), and the AI brand inconsistency study (Fishkin and O'Donnell, 2026). Specific citations available on request.

Dr. William L. Banks III

Co-Founder, GenPicked

Frequently Asked Questions

1. How do you turn raw AI responses into a ranking?

Ask the vendor to describe in plain prose how a single prompt run becomes a position on a leaderboard. The answer should distinguish absolute ranking (sort brands by mention count) from pairwise ranking (compare brands two at a time and aggregate the wins). It should name how the vendor mitigates position bias, because items in slots one through three of an AI response get attention items in slots eight through ten do not (position-bias).

2. Are your measurement prompts blind to the target brand?

Naming a brand in the prompt inflates that brand's apparent visibility by a large and measurable margin (Atwell and Alikhani, 2025). A vendor that prompts the engine with the client's brand named in the question is producing a sycophancy artifact, not a visibility measurement.

3. Which AI engines do you query, and how are they weighted?

Different engines surface different brands. A score that uses only ChatGPT is reporting a ChatGPT-specific ranking. A composite score requires an explicit engine-weighting choice, and that choice is editorial, not neutral.

4. What is the per-period sample size for each measurement?

Sample size determines whether a month-over-month rank change is signal or noise. A scan that runs three or five prompts has high variance because each prompt is a single noisy draw from a stochastic system.

5. Do you report confidence intervals or uncertainty estimates?

A rank of four with overlapping confidence intervals against ranks three and five is genuinely a rank of three to five. A dashboard that reports point estimates without uncertainty is hiding the noise band inside a precise-looking number.

6. Can you reproduce the same brand's score within five percent on a re-run?

This is the replication test. Run the same measurement twice in the same week. A defensible method produces a score within a tight band on both runs. A method that produces wildly different numbers on identical inputs is reporting noise.

7. How do you handle cross-engine disagreement?

ChatGPT, Claude, and Gemini disagree about which brand is strongest in most categories. A composite score has to make an editorial choice about how to handle that disagreement. The choice is defensible only when it is disclosed.

8. How is the category defined, and who defined it?

Rankings change meaning when the category set changes. Add a new brand to the tracked list and existing brands shift positions for reasons unrelated to actual brand strength. The category-definition call is judgment work, not statistics (construct-validity).

9. Is there a methodology changelog?

Methodologies evolve. A vendor that changes prompt templates, engine weights, or scoring formulas without a public changelog is producing month-over-month numbers that are not comparable. The change itself is fine. The lack of disclosure is the problem.

10. Are you also a media network, an agency, or otherwise compensated to surface specific brands?

The conflict of interest question. A measurement vendor that also runs sponsored placements, paid AI citation programs, or affiliate deals has structural incentive to report numbers that favor paying customers. The conflict can be managed, but only if it is disclosed.

11. What does the contract say about data ownership and portability?

If the contract ends, what can the buyer take with them? The raw measurement data should belong to the buyer, exportable in a standard format. Anything less locks the buyer into the vendor's interpretation layer.

12. How does your own brand perform on the metric you sell?

The self-test. A vendor selling AI visibility measurement should be findable when buyers ask AI engines about AI visibility vendors. If the vendor's own brand does not appear in the recommendations its product is supposed to optimize, the buyer should ask why.

13. Can the vendor produce customer references with actual numbers?

Testimonials are easy. Numbers are hard. A credible reference case shows the metric the customer started at, the actions taken, the metric the customer ended at, and the time elapsed. Anything less is a quote on a website.

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit
#academy#blog#procurement#vendor-evaluation#methodology#r3