从零读懂 RAG:大模型时代的“开卷考试”是怎样炼成的?
If you’re just getting started with LLM development, you’ve probably been hit by a wave of jargon: RAG, chunks, vector databases, embeddings… Don’t panic. These ideas are far simpler than they sound.
Imagine you’re a great student, but suddenly you’re forced to take a closed‑book exam on a book you’ve never read. You’d have to make things up – that’s exactly the hallucination problem in large language models.
What’s the fix? Let you take an open‑book exam. You don’t need to memorise the whole book – you just need to know how to quickly find the right paragraphs and use them to answer. That’s RAG in a nutshell.
This post explains RAG, its key building blocks (chunks, vector databases), and the pros & cons – all in simple terms.
1. What is RAG? – Teaching an LLM to “look things up”
RAG stands for Retrieval‑Augmented Generation. Let’s break it down:
- Retrieval – Find relevant pieces of information from a knowledge base or document.
- Augmented – Feed those pieces to the LLM as reference material.
- Generation – The LLM produces an answer based on those references.
Example:
You ask: “What is the process for requesting annual leave at our company?”
A plain LLM would guess (and likely be wrong).
RAG first looks up the relevant sections in your employee handbook, then gives those sections together with your question to the LLM, so it can answer based on the actual policy.
In one sentence: RAG = look it up first, then answer.
It’s like giving the LLM glasses and letting it read the book, rather than forcing it to recite from memory.
2. Chunks – breaking a book into sticky notes
The first step in RAG is retrieval. But computers don’t “flip through” books like we do. We need an efficient way to search. That’s where chunks come in.
Why do we need chunks?
Imagine you have a 500‑page PDF manual. You can’t just throw the whole book into the LLM – there’s a limit on input length (context window), and most of the content is irrelevant, which actually confuses the model.
So you split the document into small, self‑contained pieces. Each piece is a chunk. For example:
- Split by natural paragraphs
- Split by a fixed number of characters (e.g., 200 characters per chunk)
- Split at semantic boundaries (e.g., whenever you see a heading like
###)
Each chunk is like a sticky note that carries one complete piece of knowledge.
When you ask a question, the system searches among these sticky notes – not the whole book.
How large should a chunk be?
- Too small (e.g., 50 chars) – lacks context, may be incomplete.
- Too large (e.g., 1000 chars) – brings in irrelevant noise and wastes the LLM’s attention.
In practice, chunk sizes of 200–500 tokens (~150–400 Chinese characters, or ~150–300 English words) are common. The best size depends on your documents and use case.
3. Vector Databases – semantic search without keywords
Now we have chunks. The next question: when a user asks “how do I request annual leave”, how do we quickly find the most relevant chunks?
The traditional way is keyword search (like Elasticsearch), but it’s rigid. If the user says “leave request” and the document says “vacation request”, keyword matching fails.
That’s where vector databases come in.
First, what’s a vector?
Computers don’t understand words, but they understand numbers.
We can use an embedding model (e.g., BGE, OpenAI’s text-embedding-3) to turn a piece of text into a fixed‑length array of floating‑point numbers, like:
[0.12, -0.34, 0.56, …, 0.78]. This array is a vector.
The magic: sentences that are semantically similar end up with vectors that are close together in space.
- “Annual leave request procedure” and “How to apply for vacation days” will have very close vectors.
- “Annual leave request procedure” and “Today’s weather is nice” will be far apart.
What does a vector database do?
It’s a database specialised for storing and searching vectors. It does two things:
- Store – Save the vector of each chunk.
- Search – Given a user query (also turned into a vector), quickly find the top‑K most similar chunk vectors.
This is called vector similarity search. Common distance metrics are cosine similarity, Euclidean distance, etc. A database with one million vectors can be searched in milliseconds.
Popular vector databases: Chroma (lightweight, great for learning), Pinecone (cloud), Qdrant (open‑source, high performance), Milvus (enterprise).
Traditional databases like PostgreSQL (via the pgvector extension) also support vectors now.
4. The complete RAG flow (at a glance)
Offline (pre‑processing) stage
- Split documents (PDF, Word, web pages) into chunks.
- Convert each chunk into a vector using an embedding model.
- Store all vectors in a vector database.
Online (query) stage
- User asks a question.
- Convert the question into a vector.
- Search the vector database for the top‑K most similar chunks.
- Combine those chunks with the original question into a prompt, then send it to the LLM.
- The LLM generates an answer and returns it.
Example:
User asks: “Can we carry over unused annual leave to next year?”
- Retrieved chunk: “Unused annual leave may be carried over up to 5 days, and must be used by the end of Q1 next year.”
- Prompt built:
“Answer based only on the following material:\n[Material] Unused annual leave may be carried over up to 5 days, and must be used by the end of Q1 next year.\n\nQuestion: Can we carry over unused annual leave to next year?” - LLM answers: “Yes, up to 5 days, and you must use them by the end of Q1 next year.”
5. Advantages of RAG – why it’s the first choice for LLM applications
1. Less hallucination, more traceability
The LLM no longer “guesses from memory”. Every answer is grounded in retrieved chunks. You can even show the source text to make answers verifiable.
2. Knowledge updates instantly – no retraining
Traditional LLMs are frozen at their training cut‑off date. Updating knowledge costs millions.
With RAG, you simply add, remove, or change documents in your knowledge base – the next query will see the change.
3. Low cost and private data ready
You don’t need to fine‑tune a model or build your own LLM. Just a general‑purpose LLM (GPT‑4o, Claude, Llama 3, Qwen, etc.) + your own vector database.
Your data stays inside your infrastructure – no need to share it with model vendors.
4. Sources can be cited, increasing trust
RAG can tell you “the answer came from chapter 3, section 2”. That’s critical for customer support, legal, medical, and many other domains.
6. Disadvantages / challenges – not a silver bullet
1. Retrieval quality is everything
If the vector database fails to find the right chunk, or if the chunk contains wrong information, even the best LLM will fail. Garbage in, garbage out applies strongly to RAG.
2. Chunking requires careful tuning
Chunks that are too small lose context; chunks that are too large bring in noise. Sometimes you need overlapping chunks to avoid cutting a critical sentence in half.
Chunk tuning is often more art than science – you’ll need to experiment.
3. The LLM may still ignore retrieved content
Some models, if overconfident, might disregard the material you gave them and fall back on their own memory. This is called retrieval ignoring or lost‑in‑the‑middle.
Solutions: use a stronger model, or craft a stricter prompt (“answer strictly based on the following material – do not make things up”).
4. Retrieval + generation adds latency
A RAG request adds at least one vector search (milliseconds) and an embedding call (tens of milliseconds). For real‑time chat, total latency can reach a few hundred milliseconds to a couple of seconds. Caching and optimisation are often needed.
5. Limited complex reasoning
RAG is great at “look up and answer”, but not great at synthesising multiple documents for complex reasoning. Example: “Combine the price from document A and the discount rule from document B to calculate the final price.” RAG only supplies the data; the model itself must do the reasoning.
7. A simple analogy to remember everything
- Plain LLM – A student with great memory but who sometimes makes things up, taking a closed‑book exam.
- RAG – The same student, now allowed to bring reference books (your knowledge base).
- Chunks = tearing the book into small slips for quick lookup.
- Vector database = a smart indexing system. You ask “how to make braised pork belly”, and it instantly finds the slip that says “pork belly, rock sugar, star anise, soy sauce”.
- Generation = the student reads the slip and puts the answer into their own words.
8. Learning path for beginners
If you want to break into LLM development, RAG is the highest‑value entry point – you don’t need massive compute or the ability to train models.
-
Run a minimal demo
UseLangChain+Chroma+Ollama(running Llama 3 or Qwen locally). You can build a working RAG system in ~10 lines of code.
Search keywords:LangChain RAG tutorial. -
Understand embedding models
Try different embedding models (BAAI/bge-small-en,text-embedding-3-small). See how the similarity between two sentences changes with different models. -
Experiment with chunking strategies
Take a PDF manual (e.g., from your work). Try differentchunk_sizeandoverlapvalues, and observe the retrieval quality. -
Go further: hybrid search + reranking
Real‑world systems often combine keyword search and vector search, then use a reranking model to push the most relevant chunks to the top. This is the standard industrial approach. -
Learn to evaluate RAG systems
Use frameworks likeRAGASorARESto automatically measure your system on metrics like faithfulness, answer relevance, and context recall.
Closing thoughts
RAG is not the final form of LLMs, but it is the most practical way to bring LLMs into real‑world business applications today. It admits that LLMs hallucinate, so it gives them a reference book. It admits that training models is expensive, so it replaces memorisation with retrieval.
For a complete beginner, the barrier to RAG is surprisingly low – understand chunks and vectors, run a demo, and you’re already ahead of most people who only read about it. And when you start tuning chunk sizes, comparing embedding models, and designing hybrid retrieval pipelines, you’ll be right in the core of LLM application development.
I hope this post turns those intimidating terms into friendly tools in your toolbox.

浙公网安备 33010602011771号