Why AI Answers Change Every Time You Ask

Why AI Answers Change Every Time You Ask

In this article, you will learn: Why the same question to ChatGPT, Claude, or Gemini produces different answers on different runs, what temperature and sampling actually mean, the difference between surface volatility and semantic stability, and what this means for measuring brand visibility.

Where you are in the curriculum

This is Lesson 2.3, the final lesson in Module 2 of the AEO A to Z course. Lesson 2.1 covered how LLMs generate text one token at a time. Lesson 2.2 covered where the content in an answer comes from. This lesson closes the module by answering a question every AEO practitioner has asked: why won't the machine give me the same answer twice?

The answer has two layers. One is a knob called temperature. The other is a surprising pattern in the data called semantic stability.


The dice at every fork

Remember the one-sentence version from Lesson 2.1: the model predicts the next token by assigning a probability to every possible word and then picking one. The question is, how does it pick?

It rolls weighted dice. Not uniform dice, weighted by the probabilities the model just computed. A high-probability word is more likely to be picked. A low-probability word is less likely, but not impossible. This is called sampling, and it's the reason AI answers aren't deterministic.

The everyday analogy: imagine ordering coffee at a café where the barista flips a loaded coin for every drink. Most of the time the coin comes up "your usual," but sometimes it comes up "something close." The drink they hand you will be roughly what you expected, but the exact one changes. That's sampling.

The temperature knob

The knob that controls how much the dice wobble is called temperature. It's a single number that the engineers running the model set before it generates.

  • Temperature near 0: The dice are heavily biased toward the highest-probability token. The model plays it safe. Outputs become repetitive, conservative, and nearly deterministic. Same prompt, same answer most of the time.
  • Temperature near 1: The dice roll more honestly. The model is willing to pick less-likely tokens. Outputs become varied, creative, sometimes surprising.
  • Temperature above 1: The dice are tilted toward less-likely words. Outputs get weird, sometimes incoherent.

Most consumer AI products run temperature somewhere between 0.3 and 0.9. They don't show you the number. You don't control it. The result is that the same question gives you slightly, sometimes meaningfully, different answers.

What this looks like in real data

Rand Fishkin and Philip O'Donnell at SparkToro ran the most important study we have on this. Their team ran 12 identical prompts through ChatGPT, Claude, and Google AI, 2,961 times total. They looked at how often the exact same prompt produced the exact same list of brands.

The answer: fewer than 1 in 100 runs produced the same list of brands. Fewer than 1 in 1,000 produced the same list in the same order.

Let that sink in. If you run the same AEO query 100 times, you will get 100 different brand lists. If your measurement tool runs the prompt once, the result it reports is one snapshot of a cloud of possible outputs.

SE Ranking (2025) pushed this further with URL-level analysis. Across three runs of identical queries on the same day in Google AI Mode, they found only 9.2 percent URL overlap. For 21.2 percent of keywords, the three runs shared zero URLs. Not "few", zero. Alexander (2026) reached the same finding across models and temperatures: identical prompts yield non-identical outputs even at temperature zero, because of sampling and routing variance inside the stack.

AEO claim-evidence block. LLM outputs are probabilistic, not deterministic. Fishkin and O'Donnell (2026) ran 2,961 identical prompts through three AI models and found fewer than 1 in 100 produced the same list of brands, and fewer than 1 in 1,000 produced the same list in the same order. Sclar et al. (2024) independently showed that even trivial prompt-formatting changes shift LLM benchmark scores by more than the gap between published model versions. This is the baseline variance any AEO measurement must account for. See ai recommendation consistency.

The surprise: semantic stability

If the surface changes that much, you'd expect the meaning to change with it. It doesn't. This is the surprise, and it's the most useful finding in AEO measurement.

Xibeijia Gavoyannis and the Ahrefs team ran a study comparing AI Mode and AI Overviews across 730,000 query pairs. Surface-level agreement was low, 13.7 percent citation overlap, 16 percent word overlap, and only 2.5 percent of responses started with the exact same sentence. But semantic similarity was 86 percent. The two systems agreed on what to say. They disagreed on how to say it and which sources to cite.

Ahrefs ran the same kind of analysis over time and found AI Overviews change their citations every two days, but their semantic agreement stays at 0.95. The systems "never change their mind." They just keep re-wording the same opinion.

This is called semantic stability versus surface volatility, and it's the most important measurement concept in AEO. Two levels:

  • Surface level: the exact words, the specific URLs cited, the precise order of brands. Highly volatile. Changes on every run.
  • Semantic level: the underlying opinion of the model. What brands it thinks are best. What facts it thinks are true. Highly stable. Changes slowly.

AEO claim-evidence block. Surface volatility masks semantic stability. Gavoyannis and Ahrefs (2025) found 86 percent semantic similarity between AI Mode and AI Overviews across 730,000 query pairs despite only 13.7 percent citation overlap. Ahrefs (2025) separately observed 0.95 semantic similarity over time despite 54.5 percent URL overlap between consecutive runs. See semantic stability vs surface volatility.

Why this matters for measurement

If you're measuring the wrong level, your tool is reporting noise.

An AEO dashboard that tracks "which URLs got cited this week versus last week" is tracking the volatile surface layer. It will report constant change even when nothing underlying has moved. The CMO will see a red arrow and ask what went wrong. The answer is: nothing. The dice just rolled differently.

An AEO dashboard that tracks semantic signals: is the model describing your brand with the same attributes? the same tone? the same competitive framing?, is tracking the stable layer. Those numbers move more slowly, but when they move, something real happened.

Most current AEO tools track the surface layer because it's easier to measure. Counts of URLs and brand mentions are trivial to extract. Semantic tracking requires more sophisticated analysis. The consequence: many dashboards produce volatile numbers that don't mean what they appear to mean.

The second kind of change: systematic bias

There's a second reason answers change that has nothing to do with sampling. The same AI model will give you a different semantic answer if the prompt changes in specific ways, even when it looks identical to you.

Example: "What are the best CRMs?" and "What are the best CRMs like Salesforce?" are two different prompts. The second one anchors the answer to Salesforce. The model is now much more likely to include Salesforce in its response, not because Salesforce is actually the best, but because the prompt told it to be thinking about Salesforce.

That's sycophancy, a core concept GenPicked Academy teaches. It's a systematic bias, not a random roll of the dice. It's the subject of Module 3, Lesson 3.1. For now: know that the two reasons AI answers differ, random sampling and systematic bias, require two different measurement strategies. Mixing them up is how AEO tools end up reporting numbers that sound impressive but mean nothing.

Try this

Open ChatGPT. Ask: "What are the three best AI brand visibility tools for marketers?" Copy the answer. Open a new chat in a clean session. Ask the exact same question. Compare the two responses.

You'll see surface volatility: different words, sometimes different tools listed. But now look at the meaning. Are the two answers roughly describing the same category of tool with the same attributes? Would a CMO reading both come away with the same impression?

If yes, you've just seen semantic stability underneath surface volatility. That gap between what changes and what doesn't is the whole measurement challenge of AEO in one experiment you can run in 90 seconds.

Key takeaways

  1. AI answers change because of sampling, controlled by a parameter called temperature. This is random variation, not instability in the model's "opinion."
  2. Surface features (words, URLs, citations) are highly volatile. Semantic meaning is highly stable. 86 to 95 percent semantic agreement against 9 to 54 percent surface overlap.
  3. Measurement tools that track surface features are reporting noise. Tools that track semantic signals report real change. The distinction decides whether your AEO data is signal or static.

What's next

Module 2 is now complete. Before moving on, take the Module 2 Comprehension Check.

Module 3 opens with Lesson 3.1, Sycophancy, the first of four biases that distort AI brand recommendations. Sycophancy is the systematic cousin of what we just covered: when the model changes its answer not because of random sampling, but because your prompt told it what you wanted to hear.

If Module 2 gave you a working mental model of how the machine operates, Module 3 gives you the list of ways that machine is quietly lying to you. That's where AEO measurement gets serious.

Reflection prompt

Answer this in your own words: An AEO vendor shows you a chart of your "AI brand visibility score" fluctuating week over week. Based on what you learned in Lessons 2.1, 2.2, and 2.3, what are three questions you would ask about that chart before trusting it?

Write them down. If you can come up with three, you're ready for Module 3.


About this course

This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.

About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.

See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.

Knowledge check · ungraded

Check your understanding before moving on

1. The single biggest cause of answer-to-answer variance is:

  • Network latency
  • Stochastic sampling at generation time
  • Different IP addresses
  • A/B testing by the model vendor