Where AI Gets Its Information

Where AI Gets Its Information

In this article, you will learn: The two sources AI models draw on when they answer questions, training data and live retrieval, how they differ, why the distinction decides what shows up in answers, and what it means for your brand visibility strategy.

Where you are in the curriculum

This is Lesson 2.2 of Module 2 in the AEO A to Z course. Lesson 2.1 covered the mechanics of how an LLM produces text, tokens, next-token prediction, temperature. This lesson zooms out one level. Once the machine is ready to answer, where does it pull the content of that answer from?

The answer is: two very different places. If you conflate them, you will waste budget.


The two-source model

Every answer an AI system gives you is assembled from some mixture of two sources:

  1. Training data: Everything baked into the model's weights during training. Fixed on the day training stopped. Does not update when the world changes.
  2. Retrieval: Content the system fetches live from the web (or a private document store) at the moment you ask the question. Fresh. Changes with the index.

A pure LLM, the kind you'd run locally with no internet, has only source #1. A product like Perplexity or ChatGPT Search combines both. Google AI Mode and AI Overviews lean heavily on retrieval. Claude.ai with web search turned off leans heavily on training data.

Most of the AI search experiences you care about for AEO are hybrid. They do both. Knowing which is which decides which levers you can pull.

Source 1: Training data, what's baked in

Training data is the enormous corpus of text a model reads during training. Publicly available web pages, books, code, reference material, forum archives. The model doesn't store this text verbatim. It compresses the patterns in the text into billions of numerical weights.

Think of training data like the education a new hire brought with them on their first day. They learned it somewhere else, before they joined your company. You can't edit what's in their head. You can only ask them questions and see what comes out.

Three important consequences for AEO:

  1. Training data has a cutoff date. The model doesn't know what happened after that date, unless retrieval fills in the gap.
  2. You cannot directly change training data. It was finalized before you ever heard of the model. The next training run might include more recent content, but you don't control the schedule.
  3. Frequency matters. Brands and facts that appeared often in training data are easier for the model to produce. Rare mentions get lost.

This last point is the mechanism behind popularity bias (one of the three biases GenPicked Academy teaches you to control for): a concept we'll cover in depth in Module 3. For now, hold this: if your brand was mentioned 50 times in training data and a competitor was mentioned 5,000 times, the competitor will come up more often in un-prompted answers, all else equal.

AEO claim-evidence block. Training data is fixed at training time and drawn from a documentable web snapshot, not "the internet" in the abstract (Soldaini et al., 2024); the model's "memory" of a brand reflects how often that brand appeared in the pre-training corpus. Popularity bias is measurable, Deldjoo et al. (2024) documented at least four simultaneous biases in ChatGPT recommendations, with popularity being one of the strongest. See ai recommendation consistency.

Source 2: Retrieval, what's fetched live

Retrieval-augmented generation, or RAG, is the mechanism that lets AI products answer about things that happened yesterday. The system takes your question, runs a search behind the scenes, grabs a handful of relevant pages, and feeds those pages into the model's context window alongside your question. The model reads the fetched pages and generates an answer that blends what it already knew with what it just read.

Think of retrieval like letting the same new-hire employee Google the answer in real time before responding. They bring their old education plus whatever they just found. The quality of their answer depends on both.

Retrieval is where AEO gets actionable. You can influence what gets retrieved because you can influence what exists on the public web. The levers:

  1. Publish content that is likely to be retrieved for your target questions.
  2. Earn coverage in third-party publications that rank well for those questions.
  3. Structure content so retrievers can pull a clean, quotable chunk (headings, direct claims, named entities).
  4. Keep content fresh: retrieval systems favor recent, indexed material.

The earned media bias, why this matters a lot

Here's the finding that changes most AEO strategies. The University of Toronto (2025) studied AI citations across platforms and found that 82 to 89 percent of AI citations come from earned media: third-party articles, reviews, analyst coverage, rather than from brands' own websites. In the US, 92.1 percent of cited content was earned. Muck Rack (2025) reached the same conclusion in a parallel industry study: generative AI engines cite journalism and earned media at rates far above their share of the open web. Brand-owned marketing pages are cited rarely, if at all.

This is counter-intuitive if you came from SEO, where optimizing your own site is the main game. In AEO, your own site is mostly a supporting asset. The citation layer, the pool of URLs AI systems actually quote, is dominated by what other people wrote about you.

AEO claim-evidence block. 82 to 89 percent of AI citations come from earned media, not brand-owned content (U. Toronto 2025; Muck Rack 2025). Lorphic (2026) separately found that 86 percent of brand data is brand-managed, yet AI pulls predominantly from unmanaged third-party sources, creating a structural gap between what brands control and what AI cites. See earned media bias.

So: if you spend all of your AEO budget rewriting your own website's copy and none of it on earning coverage, you are optimizing the wrong layer.

How training and retrieval combine in an answer

When you ask a hybrid system a question, this is roughly what happens:

  1. The system decides whether to retrieve. (For some questions, "What's the capital of France?", it doesn't bother. For others, "What were Apple's earnings last quarter?", it will.)
  2. If retrieval runs, it fetches a few documents and stuffs them into the model's context window.
  3. The model generates an answer using both the fetched documents and its training-data knowledge.
  4. The answer shows up on your screen, often with inline citations for the retrieved sources.

Where your brand can appear in that answer has two different requirements:

  • To appear from training data: your brand needs to have been mentioned enough times, in enough places, in the data the model was trained on, for the pattern to be learnable. This is slow to change.
  • To appear from retrieval: your brand needs to show up on pages the retrieval system selects for the question. This is faster to change, but still depends on earned coverage more than on your own content.

This is why "AEO strategy" isn't one thing. It's at least two different games: the long game of ending up in future training data, and the shorter game of showing up in what gets retrieved today.

Try this

Pick a well-known brand. Open Perplexity (which always retrieves) and ask: "What do people say about [brand]?" Watch the citations it lists. Count how many are the brand's own domain versus third-party sites. Then do the same question on ChatGPT with web search off (training-data-only). Compare.

The Perplexity answer will lean on earned media. The training-data-only answer will lean on the patterns baked into the model. Same brand, two different pictures. That's the two-source model in action, on a live tool, in under five minutes.

Key takeaways

  1. AI answers come from two sources: training data (baked in) and retrieval (fetched live). They behave differently and respond to different interventions.
  2. Training data is closed to direct influence. Retrieval is open to influence through published content and earned coverage.
  3. The citation layer is dominated by earned media, 82 to 89 percent of AI citations. If your strategy is all owned content, you are targeting the wrong pool.

What's next

In Lesson 2.3, Why AI Answers Change Every Time You Ask, we'll go deep on why the same question produces different outputs. You'll learn the difference between surface volatility (the words change) and semantic stability (the meaning holds), and why that distinction decides whether an AEO measurement tool is producing signal or noise.

After Module 2 closes, Module 3 opens with Sycophancy, the first bias you need to recognize before you can measure AEO honestly.

Reflection prompt

Answer this in your own words: If your CEO asked you "which of our AEO efforts affects training data and which affects retrieval?", what would you say about a website rewrite, a podcast appearance, and a TechCrunch feature?

Write a three-sentence answer. If you can sort those three activities by source, you've got Lesson 2.2.


About this course

This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.

About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.

See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.

Knowledge check · ungraded

Check your understanding before moving on

1. Which of these is NOT one of the canonical sources AI engines pull from?

  • Pretraining corpus
  • Real-time web retrieval
  • Fine-tuning preference data
  • Brand-supplied push notifications