How LLMs Generate Answers

How LLMs Generate Answers

In this article, you will learn: What a large language model actually does when it answers a question, what tokens are, what "next-token prediction" means, and why the same prompt can produce different answers on different days.

Where you are in the curriculum

This is the first lesson in Module 2 of the AEO A to Z course. Module 1 covered what AEO is and why search is shifting. Module 2 is the "under the hood" module. If you want to do AEO work, measure it, sell it, or practice it, you need a working mental model of what the machine is doing. That's what this lesson from GenPicked Academy gives you.

No math. No code. Just analogies that hold up.


The one-sentence version

A large language model (LLM) is a program that has read an enormous amount of text and learned to guess what word comes next. That's it. Everything else, the fluency, the confidence, the eerie usefulness, is an emergent property of that one trick, repeated billions of times at scale.

If you remember nothing else from this lesson, remember: the model is guessing the next word. Every time.

Step 1, Tokens: the alphabet of the machine

Humans read in words. LLMs read in tokens. A token is a chunk of text, often a whole word, sometimes a piece of a word, sometimes a punctuation mark. The word "strawberry" might be three tokens. The word "the" is one. The word "unfortunately" might split into "un," "fortunate," and "ly."

Before the model can think about your question, your question is chopped into tokens. The sentence "What are the best CRMs?" becomes something like ["What", " are", " the", " best", " CR", "Ms", "?"]. That list of tokens is what the model actually sees.

Why does this matter for AEO? Because brand names get tokenized too. "Salesforce" might be one token if it appears often in training data. A newer brand like "GenPicked" might split into "Gen" and "Picked." How a brand is tokenized affects how easily the model can retrieve and mention it. You don't need to control this, but it's useful to know it's happening.

Step 2, Next-token prediction: the only trick

Here is the whole algorithm, in one sentence: given the tokens so far, the model assigns a probability to every possible next token in its vocabulary, then picks one.

That's it. The model looks at the tokens in front of it, and it ranks every word it knows by how likely that word is to come next. Then it picks one. Then it adds that word to the list and does the whole thing again. And again. Until it decides to stop.

Think of it like autocomplete on your phone, but trained on roughly the entire public internet, and running at a scale no phone could handle. Your phone's autocomplete might suggest three likely next words. An LLM is doing the same thing, ranking every word in a 50,000-word vocabulary, and doing it for every single word it outputs.

AEO claim-evidence block. LLM outputs are produced one token at a time through probabilistic next-token prediction. Because the model samples from a probability distribution rather than deterministically selecting the single highest-scoring token, the same prompt can produce fewer than 1 identical response in 100 runs, Fishkin and O'Donnell (2026) ran 2,961 identical prompts through three models and found that under 1% returned the same list of brands. See ai recommendation consistency.

Step 3, Training: how the model learned to guess

The model didn't come out of the box knowing which word to predict. It learned from a specific, documentable web snapshot, not "the internet" in the abstract (Soldaini et al. 2024). Here's the analogy.

Imagine giving a student a library of 10 trillion words and telling them one thing: "For every sentence in this library, cover up a random word and try to guess it. If you're wrong, adjust your reasoning. Do this until you're good at it." Now do that for several months with thousands of GPUs. That's training.

The model reads. It guesses. It checks its guess against the actual word. It nudges its internal parameters, billions of numbers called weights: to make better guesses next time. After enough passes through enough text, the weights encode a staggering amount of pattern-recognition about language, facts, and how humans talk about the world.

After the first training pass, there's a second one. This one uses human feedback. Human raters look at pairs of responses and say which one is better. The model learns to produce outputs humans rate highly. This is called Reinforcement Learning from Human Feedback (RLHF), and it's where a lot of the "helpful" personality of ChatGPT, Claude, and Gemini comes from. (Sharma and colleagues at Anthropic published the foundational paper on what happens when RLHF goes too far, the model starts preferring "agreeable" over "accurate." Perez et al. (2023) independently documented the same drift in model-written evaluations. That's sycophancy, covered in Module 3.)

Step 4, Training data vs. retrieval: two ways the model knows things

When the model answers you, some of what it says is baked in from training. Some of it is fetched at the moment you ask. These are two very different things, and they have very different implications for AEO.

  • Baked in: Every fact, phrase, and association in the model's training data, everything the model "remembers." This was fixed on the day training stopped. It does not update when the world changes.
  • Fetched live: When a system like ChatGPT Search, Perplexity, or Google AI Mode answers you, it often runs a web search during your question, grabs a few relevant pages, and uses those pages to inform its answer. This is called retrieval-augmented generation, or RAG.

Lesson 2.2 goes deep on this distinction. For now, just hold the fact that "the AI knows it" could mean two very different things.

Step 5, Why the output is probabilistic

Remember: at every step, the model has a list of possible next tokens with probabilities attached. The highest-probability token isn't always the one that gets picked.

There's a parameter called temperature that controls how random the pick is. Low temperature (near 0) makes the model pick the most likely token almost every time, deterministic, repetitive, safe. High temperature (near 1 or above) makes the model sample more freely, creative, varied, sometimes weird. Most consumer AI products run somewhere in the middle.

This is why the same prompt can give you different answers. It's not that the model "changed its mind." It's that at each fork in the road, the model rolled weighted dice. Different rolls, different paths, different sentences. Lesson 2.3 goes deep on what this means for measurement.

AEO claim-evidence block. LLMs produce different surface outputs for identical prompts because sampling introduces randomness at every token, a pattern independently documented by Alexander (2026) across repeated calls even at temperature zero. But the underlying semantic content is far more stable: across 730,000 query pairs, Gavoyannis and Ahrefs (2025) found 86% semantic similarity between AI Mode and AI Overviews despite only 13.7% citation overlap. See semantic stability vs surface volatility.

Try this

Open ChatGPT. Ask: "Write a one-sentence description of what a CRM is." Copy the response. Open a new chat. Ask the exact same question. Compare the two sentences.

They won't be identical. They'll mean roughly the same thing, but the words, order, and emphasis will shift. You just watched probabilistic sampling in action. The model wasn't "thinking differently." It was rolling the same dice and getting different numbers.

Now imagine an AEO tool that asks "which CRM should I buy?" five times and reports the top three brands by mention frequency. If you understand next-token prediction, you understand why that report will change depending on when the tool ran.

Key takeaways

  1. LLMs generate answers by predicting the next token, one at a time, from a probability distribution. That single mechanism, repeated at scale, produces everything you see.
  2. Training bakes patterns into the weights. Retrieval (RAG) fetches fresh content at query time. Both shape the answer; they are not the same thing.
  3. Outputs are probabilistic, not deterministic. The same prompt produces different surface text. Semantic meaning is more stable than the words.

What's next

In the next lesson, Where AI Gets Its Information, we'll go deep on the training-data-vs-retrieval distinction. You'll learn what's baked in, what's fetched, and why this split decides which AEO strategies actually work.

After that, Lesson 2.3 covers why the same question gives you different answers, and Lesson 3.1 opens the bias module with sycophancy, the first of the four biases that distort AI brand recommendations.

Reflection prompt

Before moving on, answer this in your own words: If an AEO tool reports your brand got mentioned in "62% of AI responses," what are the two mechanisms, from this lesson, that could make that number mean something different each time the tool is run?

Write it down. One or two sentences is enough. If you can answer that, you've got Lesson 2.1.


About this course

This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.

About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.

See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.

Knowledge check · ungraded

Check your understanding before moving on

1. When an LLM "generates" a recommendation, it is best described as:

  • Looking up the brand in a static ranked database
  • Sampling tokens from a probability distribution conditioned on training data and retrieval context
  • Querying Google in real time and reformatting the results
  • Reading the brand's schema markup directly

2. Two things must both happen for a brand to appear in a generative answer:

  • It must have a Google ranking AND a verified social profile
  • It must be retrievable in context AND be probable enough to be sampled
  • It must pay for placement AND be a Fortune 500 brand
  • It must run AEO software AND have schema markup