Ai Ml

RAG — Giving the Model a Library

Your company's docs, this morning's tickets, yesterday's deploy notes — none of it is in the model. Retrieval-augmented generation hands it the right page, right before it answers. Here's the two-phase pipeline, honest failure modes, and what 'embeddings' actually are.

Tin Dang avatar
Tin Dang
Hand-drawn two-row data pipeline — top row builds the library (docs, chunk, embed, vector store), bottom row answers a question (ask, embed, search, stitch, reply with sources)

Your language model has read a large portion of the public internet, most of the books ever written, and an enormous amount of code. It has not read your company’s wiki. It has not read this morning’s support tickets. It has not read the deploy note your teammate wrote at 2 AM explaining why production is on fire.

And it cannot read any of those things without help. Its knowledge is frozen at training time; it has no eyes, no access, no way to open a document you point at. Unless you build the rung we’re climbing in this post.

RAG — retrieval-augmented generation — is the rung where AI products stop making things up about your data and start actually using it. The name is terrible. The idea is simple. It’s one of the few parts of the stack that, once you see the diagram, you cannot unsee.

Hand-drawn two-phase RAG pipeline — Phase 1 builds the library offline (docs, chunk, embed, store), Phase 2 answers questions per turn (embed query, search, stitch, LLM, answer with citations)
RAG in two phases: build the library once, search it every turn. 80% retrieval engineering, 20% prompting.

The one-sentence version

RAG is library lookup: right before the model answers, hand it the most relevant passages from your documents, and ask it to use them.

That’s it. That’s the whole thing.

Everything else in this post is engineering detail about how to build the library, how to find the right passages, and where the whole thing goes wrong in practice.

Two phases, one pipeline

RAG is a pipeline with two phases. The first phase runs rarely (when documents change). The second phase runs every time a user asks a question.

Phase 1 — Build the library (runs offline).

docs → chunk → embed → store

You take your documents — wiki pages, PDFs, tickets, whatever. You split them into chunks small enough to fit comfortably inside a prompt. You convert each chunk into a vector (a list of numbers) using an embedding model. You store the vectors alongside the original text in a database designed for fast similarity search.

Phase 2 — Answer a question (runs per user turn).

question → embed → search → stitch → LLM → answer with sources

When a user asks something, you embed their question the same way you embedded the chunks. You search the store for the top few chunks whose vectors are closest to the question’s vector. You stitch those chunks into a prompt that says, roughly, “Here are some relevant passages. Using them, answer the user’s question.” The model produces the answer, ideally citing the chunks it used.

The whole pipeline has maybe seven moving parts. Each is boring on its own. The magic is the composition.

What an embedding actually is

The most jargon-heavy word in this whole stack is “embedding.” It sounds mathematical and forbidding. It is, in plain English, this:

An embedding is a way of turning a piece of text into a list of numbers, such that texts with similar meanings get similar lists of numbers.

“I love dogs” and “I adore puppies” might become two nearly identical vectors. “I love dogs” and “the GDP of France” will be far apart. The embedding model is trained to produce these vectors; you don’t write the math, you call an API.

The vector might have 768 numbers, or 1,536, or a few thousand. It doesn’t matter for your intuition. What matters is that two texts with similar vectors are, almost always, about similar things. And “close in vector space” is something a database can search for very quickly.

Once you internalize this, the rest of RAG becomes mechanical. You’re just building a search engine where the query language is “things that mean the same as this.”

Chunking: the quiet killer

The cleanest way for RAG to fail is bad chunking.

If your chunks are too big, each one covers multiple topics and the retrieval is imprecise. If your chunks are too small, they lose context — a sentence ripped from its paragraph is often meaningless. If they’re split in the middle of a sentence, or across a page break in a PDF, they become nonsense.

A decent default is chunks of 500–1,000 tokens with a small overlap (say, 50 tokens) between neighbors so that ideas that cross a boundary still have a chance of ending up intact in one chunk. But “decent” is doing heavy lifting there. Good chunking is domain-specific:

  • Code should chunk by function, not by line count.
  • Legal documents should chunk by clause, not by paragraph.
  • Conversational data (tickets, chat logs) should chunk by message or by thread, not by token count.

If your RAG system is giving vaguely-right-but-not-quite-right answers, start here before you blame the model. More products than you’d think are one chunking fix away from working.

Retrieval: top-K and its discontents

When the user asks a question, you embed it and search for the closest vectors in the store. The top K — typically 3 to 10 — are passed to the model. Choosing K is a small decision with real consequences.

Too low (K=1), and one bad match wrecks the answer. Too high (K=20), and you burn context window on irrelevant passages that dilute the model’s attention. Five is a good starting point. Measure.

Two refinements are worth knowing:

Hybrid search. Pure vector search is great at semantic matches (“dogs” matches “puppies”) but bad at exact matches (“SKU-12345” matches “SKU-12345”). Real systems often combine vector search with plain keyword search (the boring old kind) and merge the results. The gains are substantial.

Re-ranking. Pull a larger initial set (say, 20 candidates), then use a second, more expensive model to score each candidate’s relevance to the question, and keep the top 3–5. Slower, much more precise. Worth it when answer quality matters.

Neither of these is mysterious. They’re engineering knobs. The model is not involved in any of this — retrieval is a classical search problem, dressed in new clothes.

The prompt, stitched

Once you have your top chunks, you assemble a prompt.

A reasonable template looks like this:

You have access to the following relevant passages from the user's documents:
<passage id="1" source="handbook.md">
{chunk 1 text}
</passage>
<passage id="2" source="deploy-runbook.md">
{chunk 2 text}
</passage>
...
Use these passages to answer the user's question. If the passages do
not contain the answer, say so rather than guessing. Cite passages by id.
Question: {user's question}

This is where the magic happens — not because the template is special, but because the model now has the answer in its context window and is being told to use it. A strong RAG system will quote from the passages almost verbatim on factual questions, which is exactly what you want. A weak one will still hallucinate despite having the passages right there, which is almost always a sign that the model isn’t being given enough explicit instruction to prefer the retrieved text over its own knowledge.

Failure modes, honest

RAG is powerful and regularly disappoints. The most common failure modes, roughly in order of frequency:

1. Wrong chunks retrieved. The user’s question looks semantically close to a chunk about a different topic. The model answers confidently using the wrong source. Fix: better chunking, hybrid search, re-ranking.

2. Right chunks retrieved, ignored by the model. The passages contain the answer, but the model leans on its prior training instead. Fix: sharper prompt (“if the passages contradict your prior knowledge, trust the passages”), or use a stronger model.

3. Stale index. Documents have changed since the embeddings were built. Fix: rebuild on a schedule, or on document change.

4. Ambiguous questions. The user asks something so vague that no chunk is clearly best. Fix: make the product ask a clarifying question first, or return the top-K as suggestions rather than an answer.

5. Contradictory sources. The documents themselves disagree (version A says one thing, version B says another). The model averages them and confuses everyone. Fix: curate the source documents; RAG is not a substitute for a single source of truth.

Notice that four of the five failure modes are about the retrieval step, not the model. Good RAG is 80% retrieval engineering, 20% prompting.

RAG vs fine-tuning

Every time RAG comes up, someone asks: “Why not just fine-tune the model on our data?” Here’s the honest comparison.

RAGFine-tuning
Adds new knowledgeYes, instantlyYes, but slowly and imprecisely
Updates when docs changeJust re-indexRe-train
Shows sources to the userNaturallyAlmost never
Hallucination riskLow, if retrieval is goodStill moderate
Cost per queryHigher (more tokens)Lower
Setup costMediumHigh
Good for teaching style / formatWeakStrong

RAG is almost always the right first move for “the model doesn’t know this specific thing.” Fine-tuning is the right move for “the model consistently gets the shape of the answer wrong” (wrong voice, wrong format, wrong reasoning pattern). They’re solving different problems and can be used together.

What RAG still cannot do

RAG fixes one specific limit of the chat baseline: the model’s ignorance of external data. It does nothing about the other three limits.

It still cannot take action in the world. It still cannot remember across sessions without more work. It still answers one turn at a time and gives up when the first attempt fails.

Those are the next three rungs. The next one — tool use — is the big one. It is the rung where AI products stop being information retrievers and start being actors.

Read next: Tool Use — Giving the Model Hands.

0

Next in this series

Tool Use — Giving the Model Hands

Continue reading