Ai Ml

The Chat Baseline: What You're Starting With

Every AI product in the world starts as the same small, strange thing — a stateless function that turns tokens into tokens. Understand that substrate clearly and every capability above it stops looking like magic.

Tin Dang April 13, 2026 8 min read

Hand-drawn linear flow from user input through a context window stack into an LLM cloud, then token-by-token output, with a feedback loop returning output to input

Before you build anything interesting on top of a language model, it helps to know precisely what a language model is — and, just as importantly, what it isn’t. The gap between intuition and reality is bigger than most people realize, and almost every confusing thing about AI products later in this series traces back to a missed detail at this layer.

This post is the floor. The rung labeled LLM Core. By the end of it you’ll be able to look at any chatbot and sketch, in one diagram, exactly what happens between you pressing Enter and a reply appearing.

Hand-drawn diagram showing the chat baseline: tokens flow into an LLM function, tokens flow out word by word, with the entire conversation re-sent every turn — no real memory, just replay — The chat baseline: a stateless function that turns tokens into tokens, with the full conversation replayed every turn.

The model is a function

Here is the entire contract, written honestly:

tokens_in → model → next_token_probabilities → sampled_next_token

That’s it. A large language model takes a sequence of tokens (roughly: pieces of words), and returns a probability distribution over what token should come next. An outer loop then samples one token from that distribution, appends it to the input, and asks the model again. Repeat until the model emits a “stop” token or the budget runs out.

The model predicts the next token. Nothing more.

It does not reason in the way humans reason. It does not look anything up. It does not have a memory of previous conversations. It does not know what day it is, unless you tell it. It does not know whether its last answer was right, because it has no concept of “last answer” — by the time the next request arrives, it has already forgotten everything.

If this sounds reductive, it is also precise. Every impressive behavior you’ve seen from a modern model — reasoning, planning, apologizing for a mistake — is produced inside one continuous prediction pass. The model is not thinking; it is rapidly producing text that, statistically, is what thinking would look like if you wrote it down.

That is not a pejorative. It turns out to be incredibly useful. But the moment you forget it, you will over-trust the output.

The memory illusion

“But wait,” you say. “The chatbot I used yesterday clearly remembered what I asked earlier in the conversation.”

It did — but the model didn’t. Here is the trick, and it is the single most important thing to understand at this rung:

Every turn, the product re-sends the entire conversation so far.

When you type a new message, the system gathers:

The system message (hidden instructions — we’ll cover this in the next post)
Every previous user message and assistant reply in the current chat
Your new message

…concatenates them into one long string, and sends the whole thing to the model as a fresh request. The model reads it top-to-bottom, has no memory of having seen any of it before, and produces the next turn. Then the product appends the new reply to the record, and the next turn begins.

The “memory” you experience is a costume the product wears. The model underneath is as stateless as a calculator.

This has several immediate consequences that confuse people constantly:

Long conversations get slower and more expensive. You’re sending a bigger and bigger transcript each turn.
The model’s “personality” can drift as the system message gets diluted by user text accumulating above it.
Anything the model “forgets” in a long chat is often a product choice, not a model limit: it’s the harness deciding to trim old messages to stay within a size limit.

The context window

The long string the model reads has a size limit, measured in tokens. That limit is called the context window, and it is the most physical constraint in the entire stack.

As of 2026, context windows range from around 8K tokens on older free-tier models to one million or more on newer paid models. One million tokens is roughly a thousand pages of text — the size of a novel trilogy, give or take. Enormous, by historical standards. Still finite.

When a conversation approaches the limit, one of three things happens depending on the product:

Hard cut. The oldest messages fall off. The model stops being able to see them, and the product, from your perspective, “forgets” the start of the chat.
Summarization. The harness rewrites old messages as a short summary, reinserts that summary near the top, and drops the originals. Cheaper, but lossy.
Retrieval. Older messages are stashed in a searchable store, and relevant bits are fetched back into context when the model needs them. We’ll meet this idea in post 4.

The important point for now: there is no magic. The model only ever sees what fits in the context window, and the context window is finite. Every “memory” feature you’ll meet in later posts is some variation of pretending to have a bigger window than you actually have.

Token-by-token generation

One more detail. The output is not generated whole, in one pass. It is generated one token at a time, and each new token is immediately fed back as input so the next token can be conditioned on it.

This is why streaming UIs work. It is also why models can seem to “change their mind” mid-sentence — they cannot revise what they’ve already written. Once a token is sampled, it is committed.

There is a small but real consequence: models are not great at producing text that depends on facts they’ve already stated incorrectly. If token 12 was a wrong number, tokens 13 onward will build on that wrong number. The only way out is to generate again from scratch.

It also means two decisions with lasting effects are made inside the sampling loop:

Temperature — how “creative” the sampling is. Near zero, the model picks the most likely token almost every time and behaves predictably (good for code, classification). Higher up, it samples from less-likely tokens (good for brainstorming, poetry). Too high, it gets incoherent.
Stop tokens — signals that tell the loop to halt. Without good stop conditions, a model might happily keep going forever, especially on open-ended prompts.

Neither of these is the model itself. They are knobs on the harness. But they profoundly shape how the model appears to behave.

What this substrate can and cannot do

Given all of the above, here is an honest capability sheet for a pure chat baseline:

Can do:

Answer one-turn questions about anything in the model’s training data.
Carry on a multi-turn conversation, with memory degrading gracefully as the context fills.
Follow instructions embedded in the prompt (“be concise”, “answer in French”).
Produce structured text when asked (JSON, code, tables) — though without checks on whether it’s correct.

Cannot do:

Look up anything outside the prompt it has been given.
Take any action in the outside world.
Remember anything across separate chats or sessions.
Know whether its answer is correct — or admit, reliably, when it doesn’t know.
Catch its own mistakes after they happen.

Almost every feature added in the rest of this series is a response to one of those limits. RAG fixes “can’t look anything up.” Tool use fixes “can’t take any action.” Memory fixes “can’t remember across sessions.” Agent loops fix “can’t catch its own mistakes.”

The floor is real, and it is small. That is why the tower matters.

The honest diagnostic

If someone tells you a product is “just a better chatbot,” you can test the claim with four questions:

Can it do anything other than produce text? If no, you’re at this rung.
Does it know about things that happened after its training cutoff? If no, no retrieval rung.
Does it remember what you told it in a conversation last week? If no, no memory rung.
If it gives a wrong answer, does it notice and retry? If no, no agent loop.

A clean “no” to all four is not a failure — it just means the product lives at the chat baseline, and you should evaluate it on baseline terms: does it produce text that is genuinely useful to you in a single turn?

Most products that feel magical have climbed at least to rung 3 or 4. Almost none of them advertise clearly which rungs they’ve built. This series is, in part, a way to ask them better questions.

Where we go next

The cheapest, highest-leverage thing you can do to a chat baseline without climbing further is to write a serious system prompt. That’s the next post, and it’s where most products get far more value than people credit — or fail to, and then blame the model.

Next in this series

System Prompts & Personas: The Cheapest Control Surface

The model is a function

The memory illusion

The context window

Token-by-token generation

What this substrate can and cannot do

The honest diagnostic

Where we go next

Related Posts

Skills — Giving an Agent a Playbook

Multi-Agent Systems & Production Platforms

Memory — How Agents Build Continuity