LLMs Are Next-Token Predictors — Everything Follows From That — Building AI-Powered Products

You already know how an LLM works

Open your phone. Start typing a text message. See those word suggestions above the keyboard? Tap one. Now another. Keep tapping suggested words and you get a sentence — maybe a weird one, but a sentence.

You just did what an LLM does. It picks the next word (well, token — we'll get to that), adds it to the sentence, then picks the next one. Over and over. Thousands of times per response.

That's it. That's the whole trick. Everything else — the cost, the speed, the mistakes, the magic — flows from that one loop.

The prediction loop, step by step

Here's what happens every time you send a message to an LLM:

Let's walk through each box:

Step 1 — The Shredder (Tokeniser). Your message gets chopped into small pieces called tokens. Think of a paper shredder — it doesn't care about your words, it just cuts at fixed points. "Hello, world!" becomes four pieces: Hello , world !.

Step 2 — The Number Translator. Each piece gets a number. "Hello" might be #9906. The computer only understands numbers, so every piece needs an ID — like how every student in school gets a student number.

Step 3 — The Meaning Map (Embedding). The model places each numbered piece on a giant map where similar meanings are close together. "Happy" and "joyful" are neighbours. "Happy" and "refrigerator" are far apart. This map has hundreds of dimensions — way more than the 2D maps you're used to.

Step 4 — The Brain (Transformer). This is where the magic lives. The transformer looks at all the pieces on the map and asks: "Given everything I've seen so far, what piece should come next?" It scores every possible next piece and picks one.

Step 5 — Loop. The chosen piece gets added to the input. The transformer runs again. And again. And again — until it decides it's done (by outputting a special "stop" token).

There Are No Dumb Questions

"Wait — it loops? So a 500-word answer means the transformer runs 500+ times?"

Yep. Every single token requires a full pass through the transformer. A 500-token answer costs roughly 500x more compute than a 1-token answer. That's why long responses are expensive and slow.

"Does it plan ahead? Like, does it know how the sentence will end?"

Nope. Zero planning. It only ever picks the next token. It's like writing a story one word at a time without knowing where it's going. The fact that the output usually makes sense is what's remarkable — and why it sometimes doesn't.

⚡

Be the LLM

25 XP

You are an LLM. Your job: predict the next token. Here's your input so far: **"The capital of France is"** You score every possible next token. Here are your top 5 candidates with their probability scores: | Token | Score | |-------|-------| | Paris | 0.92 | | located | 0.03 | | a | 0.02 | | known | 0.02 | | the | 0.01 | 1. Which token do you pick? Why? 2. After you pick it, what's the new input for your next prediction? 3. If temperature is set to 0 (always pick the highest score), will you always pick the same token? What if temperature is higher? _Hint: At temperature=0 you always pick the top scorer. At higher temperatures, lower-scored tokens get a bigger chance — that's how you get creative (but sometimes weird) responses._

Tokens are the currency — learn to count them

Every API call costs money. The price tag? Tokens. Not words — tokens. So you need to know how tokens and words relate.

Here's the cheat code:

The ¾-Word Rule: 1 word ≈ 1.33 tokens. Or flip it: 1 token ≈ ¾ of a word.

To estimate: token count = word count × 1.33

Caveat: This rule holds for typical English prose. Code, non-English text, and technical terms often tokenize less efficiently — sometimes 2–3× more tokens per word. For accurate billing estimates, validate with your model provider's tokenizer before building a production cost model.

Let's see why. Here's how "Hello, world!" gets tokenised:

See? Two words became four tokens. That comma and exclamation mark each count separately. Spaces before words get bundled with the word ( world is one token, not two).

Why this matters for your wallet: Verbose prompts with lots of punctuation and formatting burn more tokens than they look. Trimming filler from your prompts is free money.

⚡

Token Estimation Race

25 XP

Estimate the token count for each prompt below. Use the formula: **tokens = words × 1.33** (round to nearest whole number). | Prompt | Word count | Your estimate | |--------|-----------|---------------| | "Hi, what time is it?" | 5 | ? | | A 60-word paragraph of coding instructions | 60 | ? | | A 2,000-word legal document | 2,000 | ? | Now calculate cost for each at Claude Sonnet pricing ($3 per million input tokens — model-specific, as of early 2025; pricing applies to the specific model ID referenced — verify at claude.com/pricing as prices change with new model releases): | Prompt | Tokens | Cost | |--------|--------|------| | A | ? | ? | | B | ? | ? | | C | ? | ? | _Hint: Prompt A = 5 × 1.33 = 6.65, rounds to 7 tokens. Cost = 7 ÷ 1,000,000 × $3 = $0.000021. That's basically free. Now do B and C._

The context window: your model's short-term memory

Every model has a context window — the maximum number of tokens it can hold in memory at once. Think of it like a desk: you can only spread out so many papers before things start falling off.

Model	Context window	Roughly how many words
Claude Haiku	200k tokens	~150,000 words
Claude Sonnet	200k tokens	~150,000 words
GPT-4o	128k tokens	~96,000 words

(Context window figures as of early 2025 — model specs change frequently. Verify at anthropic.com for Claude models and platform.openai.com for OpenAI models. Model lineups evolve; newer models may be available.)

Pricing is illustrative and changes frequently — check provider documentation for current rates.

Here's the trap: the context window includes both your input AND the model's output. So if you stuff 190k tokens of input into a 200k-token window, the model can only generate a 10k-token response before it hits the wall.

And here's the worse trap: blow the context window and your API call fails at runtime, not at design time. No compiler error. No warning. Your app just crashes in production when a user sends a long enough input.

There Are No Dumb Questions

"If the context window is 200k tokens, can the model actually USE all 200k equally well?"

No! Research shows models are worst at finding information stuck in the middle of a long context. They're best at stuff near the beginning and the end. It's called the "lost in the middle" problem (Liu et al., 2023) — like how you remember the first and last items on a grocery list but forget the middle ones.

context-window4096 tokens

512 tokens128000 tokens

Standard — handles most documents and conversations.

Model routing: stop paying luxury prices for basic tasks

Here's a real scenario. Priya, a backend engineer at a legal-tech startup, blew through her API budget in week one. She sent every document — no matter how simple — to Claude Sonnet. Monday morning: $39.90 in charges from a single overnight run. 1,000 legal briefs, all routed to Sonnet. At that rate, monthly spend would hit $1,197.

Then she ran the numbers:

	Sonnet ($3/M tokens)	Haiku ($0.25/M tokens)	Savings
1 brief (10k words ≈ 13,300 tokens)	$0.04	~$0.0033	~92%
1,000 briefs/day	$40/day	~$3.33/day	~92%
Monthly (30 days)	$1,200	~$100	~92%

The fix? Model routing — sending each task to the cheapest model that can handle it well.

Think of it like a hospital triage system:

(Pricing as of early 2025 — verify current rates at claude.com/pricing)

"Extract the party names from this contract" — that's a Haiku job. No reasoning needed. $0.003.

"Identify every clause that creates liability and rank them by risk" — that demands deeper reasoning. Sonnet earns its cost here.

How a prompt flows through an LLM

Summarisethiscontractin3bulletpoints.Paymentterms:Net30.Liability:500Kcap.Exit:30daysnotice.

25 tokens

The starting point: For many pipelines, the majority of calls are simple extractions or classifications that smaller models handle well — reserve larger models for tasks that genuinely require complex reasoning. The right split depends on your specific task mix; measure accuracy on each tier with a sample eval before committing to a routing strategy in production.

⚡

Triage Challenge

50 XP

You're the triage nurse for an AI pipeline. Route each task to the right model tier and estimate the cost. | Task | Haiku, Sonnet, or Opus? | Why? | |------|------------------------|------| | Classify customer support emails into 5 categories | ? | ? | | Write a detailed technical architecture proposal | ? | ? | | Extract dates and names from 500 invoices | ? | ? | | Analyse a legal contract for hidden liability clauses | ? | ? | | Translate "Hello" to Spanish | ? | ? | **Bonus (50 XP):** Priya's pipeline processes 1,000 briefs/day. 800 are simple extractions, 150 need analysis, 50 need deep reasoning. Calculate the daily cost if she routes correctly vs. sending everything to Sonnet. _Hint: Simple extraction → Haiku. Deep reasoning → Opus. Everything in between → Sonnet. For the bonus: (800 × $0.0033) + (150 × $0.04) + (50 × $0.20) vs. (1,000 × $0.04)._

Why LLMs make stuff up (and what to do about it)

Here's a conversation between Token (an LLM token) and User about hallucinations:

User: Why do you sometimes make up facts that sound totally real?

Token: Because I don't know facts. I predict what token comes next based on patterns I saw during training. If the most likely next token after "The capital of Australia is" is "Sydney" — because lots of text on the internet says that — I'll say Sydney. Even though it's wrong. (It's Canberra.)

User: That's terrifying. Can you at least tell me when you're not sure?

Token: Not really. My confidence score tells you how likely I think a token is compared to alternatives. It does NOT tell you whether the statement is true. I can be 99% confident about a completely false statement — because the pattern I learned was wrong, or the context is misleading.

User: So how do engineers deal with this?

Token: Three ways. RAG (Retrieval-Augmented Generation) — give me real documents to reference so I'm not relying only on my training data. Evals — test my answers against known-correct answers systematically. Human review gates — have a human check my work for high-stakes outputs.

🚨LLMs don't know what they don't know

An LLM that lacks information about something will often confidently invent a plausible-sounding answer rather than saying "I don't know." This is called hallucination. Design your systems to treat LLM output as a first draft that requires verification — never as ground truth for high-stakes decisions.

There Are No Dumb Questions

"If LLMs just predict tokens, how do they seem to 'reason'?"

Great question. When you see an LLM work through a problem step by step, it's not "thinking" the way you do. It's predicting that the next most likely tokens form a reasoning chain — because it was trained on millions of examples of humans reasoning step by step. The output looks like reasoning because the training data contained reasoning. Whether it truly "understands" is still debated, but for engineering purposes: treat it as a sophisticated pattern matcher, not a thinker.

Temperature: the creativity dial

When the model scores all possible next tokens, temperature controls how it picks from that list.

Temperature	What happens	Good for
0	Always picks the highest-scored token	Factual answers, code, extraction
0.3–0.7	Usually picks high-scored tokens but sometimes goes off-script	General conversation, analysis
1.0+	Lower-scored tokens get a real shot	Creative writing, brainstorming

Think of it like a restaurant ordering system:

Temperature 0: You always order the #1 most popular dish. Predictable. Safe. Boring.
Temperature 0.5: You usually order a popular dish but sometimes try something new.
Temperature 1.0: You might order anything on the menu. Adventurous. Sometimes amazing. Sometimes... squid ink ice cream.

There Are No Dumb Questions

"If I set temperature to 0, will I get the exact same response every time?"

Surprisingly, no — not always. Even at temperature=0, floating-point math on different hardware can produce tiny rounding differences that occasionally change which token gets picked. It's mostly deterministic, but don't build systems that require identical outputs from identical inputs.

temperature1

Creative — more surprising word choices. Good for brainstorming and writing.

How models get trained (the 60-second version)

You don't need to train models — but you need to know why they behave the way they do. Three stages:

Stage 1 — Pre-training: Reading the internet. The model reads billions of web pages, books, and code. It learns patterns: grammar, facts, writing styles, reasoning patterns. This is expensive (millions of dollars) and produces a model that can complete text but isn't helpful yet — like a kid who's read every book in the library but has zero social skills.

Stage 2 — Fine-tuning: Learning to be helpful. Humans write example conversations: "If a user asks X, a good response looks like Y." The model trains on thousands of these examples to learn how to be an assistant, not just a text completer.

Stage 3 — RLHF: Learning from feedback. Humans rate the model's responses: "This answer was great, this one was terrible." The model adjusts to produce more highly-rated outputs. This is the stage that teaches models to refuse harmful requests, stay on topic, and be genuinely useful. This post-training alignment stage — using human feedback to reward helpful, safe responses — is central to the behaviour you see in frontier models. (Modern models like GPT-4o and its successors use a combination of RLHF, supervised fine-tuning, and additional alignment techniques; RLHF is one key component. Anthropic trains Claude via Constitutional AI — a related approach that uses AI-generated feedback to supplement human feedback, reducing (but not eliminating) reliance on human raters.)

⚡

Match the behaviour to the stage

25 XP

Pre-trainingFine-tuningRLHF

The model knows that Paris is the capital of France

The model responds in a conversational Q&A format

The model refuses to help you build a weapon

The model can write Python code

The model says "I'm not sure" instead of making up an answer

2. The model responds in a conversational Q&A format →

0/5 answered

Back to the phone keyboard

Those word suggestions above your keyboard are still doing the same thing an LLM does — picking the next token. The difference is scale: your phone's model is tiny and its predictions are generic, while GPT-4 has hundreds of billions of parameters and can write legal briefs. But the loop is identical. Every cost you'll pay, every hallucination you'll debug, and every clever trick you'll use to make AI work in production traces back to that one mechanism: predict the next token, add it, repeat.

Key takeaways

The whole game is next-token prediction. Every LLM feature, cost, and failure mode traces back to this one loop.
Tokens ≠ words. Use the ¾-word rule (words × 1.33) to estimate token counts and costs before you build.
Route aggressively. Send simple tasks to cheap models. Reserve expensive models for hard reasoning. That single change can significantly cut pipeline costs.
Temperature controls creativity vs. consistency. Low for facts, high for brainstorming.
LLMs don't know things — they predict things. That's why they hallucinate, and why you need RAG, evals, and human review.

Knowledge Check

1.A pipeline processes 1,000 legal briefs per day, each approximately 10,000 words. Using the ¾-word rule and Claude Sonnet pricing of $3 per million input tokens (as of early 2025), what is the estimated daily input cost?

2.You stuff a 90k-token legal document into a Claude Sonnet prompt and ask a question whose answer appears on page 40 of 80. Based on the 'lost in the middle' research, what should you expect?

3.Why doesn't setting temperature=0 guarantee that two identical API calls return identical output?

4.Which post-training approach is most directly responsible for teaching frontier models (e.g., GPT-4o, Claude) to decline harmful requests and follow instructions helpfully?

You already know how an LLM works

You just did what an LLM does. It picks the next word (well, token — we'll get to that), adds it to the sentence, then picks the next one. Over and over. Thousands of times per response.

That's it. That's the whole trick. Everything else — the cost, the speed, the mistakes, the magic — flows from that one loop.

The prediction loop, step by step

Here's what happens every time you send a message to an LLM:

Let's walk through each box:

Step 5 — Loop. The chosen piece gets added to the input. The transformer runs again. And again. And again — until it decides it's done (by outputting a special "stop" token).

There Are No Dumb Questions

"Wait — it loops? So a 500-word answer means the transformer runs 500+ times?"

Yep. Every single token requires a full pass through the transformer. A 500-token answer costs roughly 500x more compute than a 1-token answer. That's why long responses are expensive and slow.

"Does it plan ahead? Like, does it know how the sentence will end?"

Nope. Zero planning. It only ever picks the next token. It's like writing a story one word at a time without knowing where it's going. The fact that the output usually makes sense is what's remarkable — and why it sometimes doesn't.

⚡

Be the LLM

25 XP

Tokens are the currency — learn to count them

Every API call costs money. The price tag? Tokens. Not words — tokens. So you need to know how tokens and words relate.

Here's the cheat code:

The ¾-Word Rule: 1 word ≈ 1.33 tokens. Or flip it: 1 token ≈ ¾ of a word.

To estimate: token count = word count × 1.33

Caveat: This rule holds for typical English prose. Code, non-English text, and technical terms often tokenize less efficiently — sometimes 2–3× more tokens per word. For accurate billing estimates, validate with your model provider's tokenizer before building a production cost model.

Let's see why. Here's how "Hello, world!" gets tokenised:

See? Two words became four tokens. That comma and exclamation mark each count separately. Spaces before words get bundled with the word ( world is one token, not two).

Why this matters for your wallet: Verbose prompts with lots of punctuation and formatting burn more tokens than they look. Trimming filler from your prompts is free money.

⚡

Token Estimation Race

25 XP

The context window: your model's short-term memory

Every model has a context window — the maximum number of tokens it can hold in memory at once. Think of it like a desk: you can only spread out so many papers before things start falling off.

Model	Context window	Roughly how many words
Claude Haiku	200k tokens	~150,000 words
Claude Sonnet	200k tokens	~150,000 words
GPT-4o	128k tokens	~96,000 words

Pricing is illustrative and changes frequently — check provider documentation for current rates.

There Are No Dumb Questions

"If the context window is 200k tokens, can the model actually USE all 200k equally well?"

No! Research shows models are worst at finding information stuck in the middle of a long context. They're best at stuff near the beginning and the end. It's called the "lost in the middle" problem (Liu et al., 2023) — like how you remember the first and last items on a grocery list but forget the middle ones.

context-window4096 tokens

512 tokens128000 tokens

Standard — handles most documents and conversations.

Model routing: stop paying luxury prices for basic tasks

Then she ran the numbers:

	Sonnet ($3/M tokens)	Haiku ($0.25/M tokens)	Savings
1 brief (10k words ≈ 13,300 tokens)	$0.04	~$0.0033	~92%
1,000 briefs/day	$40/day	~$3.33/day	~92%
Monthly (30 days)	$1,200	~$100	~92%

The fix? Model routing — sending each task to the cheapest model that can handle it well.

Think of it like a hospital triage system:

(Pricing as of early 2025 — verify current rates at claude.com/pricing)

"Extract the party names from this contract" — that's a Haiku job. No reasoning needed. $0.003.

"Identify every clause that creates liability and rank them by risk" — that demands deeper reasoning. Sonnet earns its cost here.

How a prompt flows through an LLM

Summarisethiscontractin3bulletpoints.Paymentterms:Net30.Liability:500Kcap.Exit:30daysnotice.

25 tokens

⚡

Triage Challenge

50 XP

Why LLMs make stuff up (and what to do about it)

Here's a conversation between Token (an LLM token) and User about hallucinations:

User: Why do you sometimes make up facts that sound totally real?

User: That's terrifying. Can you at least tell me when you're not sure?

User: So how do engineers deal with this?

🚨LLMs don't know what they don't know

There Are No Dumb Questions

"If LLMs just predict tokens, how do they seem to 'reason'?"

Great question. When you see an LLM work through a problem step by step, it's not "thinking" the way you do. It's predicting that the next most likely tokens form a reasoning chain — because it was trained on millions of examples of humans reasoning step by step. The output looks like reasoning because the training data contained reasoning. Whether it truly "understands" is still debated, but for engineering purposes: treat it as a sophisticated pattern matcher, not a thinker.

Temperature: the creativity dial

When the model scores all possible next tokens, temperature controls how it picks from that list.

Temperature	What happens	Good for
0	Always picks the highest-scored token	Factual answers, code, extraction
0.3–0.7	Usually picks high-scored tokens but sometimes goes off-script	General conversation, analysis
1.0+	Lower-scored tokens get a real shot	Creative writing, brainstorming

Think of it like a restaurant ordering system:

Temperature 0: You always order the #1 most popular dish. Predictable. Safe. Boring.
Temperature 0.5: You usually order a popular dish but sometimes try something new.
Temperature 1.0: You might order anything on the menu. Adventurous. Sometimes amazing. Sometimes... squid ink ice cream.

There Are No Dumb Questions

"If I set temperature to 0, will I get the exact same response every time?"

Surprisingly, no — not always. Even at temperature=0, floating-point math on different hardware can produce tiny rounding differences that occasionally change which token gets picked. It's mostly deterministic, but don't build systems that require identical outputs from identical inputs.

temperature1

Creative — more surprising word choices. Good for brainstorming and writing.

How models get trained (the 60-second version)

You don't need to train models — but you need to know why they behave the way they do. Three stages:

⚡

Match the behaviour to the stage

25 XP

Pre-trainingFine-tuningRLHF

The model knows that Paris is the capital of France

The model responds in a conversational Q&A format

The model refuses to help you build a weapon

The model can write Python code

The model says "I'm not sure" instead of making up an answer

2. The model responds in a conversational Q&A format →

0/5 answered

Back to the phone keyboard

Key takeaways

The whole game is next-token prediction. Every LLM feature, cost, and failure mode traces back to this one loop.
Tokens ≠ words. Use the ¾-word rule (words × 1.33) to estimate token counts and costs before you build.
Route aggressively. Send simple tasks to cheap models. Reserve expensive models for hard reasoning. That single change can significantly cut pipeline costs.
Temperature controls creativity vs. consistency. Low for facts, high for brainstorming.
LLMs don't know things — they predict things. That's why they hallucinate, and why you need RAG, evals, and human review.

Knowledge Check

2.You stuff a 90k-token legal document into a Claude Sonnet prompt and ask a question whose answer appears on page 40 of 80. Based on the 'lost in the middle' research, what should you expect?

3.Why doesn't setting temperature=0 guarantee that two identical API calls return identical output?

4.Which post-training approach is most directly responsible for teaching frontier models (e.g., GPT-4o, Claude) to decline harmful requests and follow instructions helpfully?