RAG stands for retrieval-augmented generation. The concept sounds complicated but the core is three steps: break your documents into chunks, turn those chunks into vectors, and at query time find the most relevant chunks and pass them to the LLM as context.
That is the whole thing. Everything else is implementation details.
This guide covers a complete local RAG pipeline with no cloud dependencies and no data leaving your machine.
—
Why RAG beats stuffing documents into context
The naive approach to giving an LLM access to your documents is to paste them all into the context window. This works until it does not.
Context windows have limits. Even large context models start degrading in quality on long inputs. Studies consistently show that information in the middle of a very long context gets underweighted compared to the beginning and end. If your answer is in document 7 of 20, the model may miss it.
RAG retrieves precisely. Instead of feeding everything and hoping, you retrieve only the 3-5 most relevant chunks for the specific query. The model gets a focused context and produces a more reliable answer.
RAG scales. Once your documents are embedded, you can query across thousands of pages without increasing per-query token cost. The context window stays small and consistent regardless of total document volume.
The tradeoff: RAG requires upfront indexing work and introduces retrieval errors. If the retriever picks the wrong chunks, the LLM answers confidently from irrelevant content. Good chunking strategy is what prevents this.
—
Chunking strategy: the decisions that matter most
Chunking is where most RAG implementations go wrong. Model choice matters less than how you split your documents.
Fixed-size chunking splits text every N characters or tokens with some overlap. It is simple and fast. The overlap (typically 10-20% of chunk size) ensures context is not cut off at boundaries. This is the right starting point for most use cases.
A practical default: 512 tokens per chunk, 64-token overlap. Adjust based on your document type:
- Dense technical documentation: smaller chunks (256-384 tokens) so each chunk covers one concept
- Narrative text or articles: larger chunks (512-768 tokens) to preserve context across sentences
- Code: split at function or class boundaries rather than by token count
Semantic chunking splits at paragraph or section boundaries rather than arbitrary token counts. It produces more coherent chunks but requires more preprocessing. Worth doing if your documents have clear structure (headers, sections) that fixed-size chunking would cut across.
The overlap principle. Always use overlap. A question whose answer spans a chunk boundary will fail without it. 10-15% overlap is usually enough; more than 25% creates redundancy that slows retrieval without improving results.
Metadata matters. Attach metadata to every chunk: source filename, page number, section title if available. You will want this for citations and for debugging retrieval failures.
—
Local embedding models
Embedding models convert text into vectors that encode semantic meaning. Similar meaning produces similar vectors, which is what makes retrieval work.
For local use in 2026, three options cover most needs:
nomic-embed-text (via Ollama) is the easiest starting point. 137M parameters, runs fast even on CPU, and produces 768-dimensional embeddings that work well for most retrieval tasks. The quality is solid for English text and decent for multilingual use. Pull it with ollama pull nomic-embed-text.
mxbai-embed-large (via Ollama or HuggingFace) produces higher-quality embeddings with better performance on technical and domain-specific content. 335M parameters; still CPU-viable but noticeably faster on GPU. Use this when nomic-embed-text retrieval quality is not meeting your needs.
all-MiniLM-L6-v2 (via sentence-transformers) is the lightweight option: 22M parameters, very fast on CPU, reasonable quality for general English text. Good choice if you are running on constrained hardware or need to embed a large corpus quickly.
All three run entirely locally. No API keys, no data leaving your machine.
Important: use the same embedding model at index time and query time. If you index with nomic-embed-text and query with mxbai-embed-large, retrieval will not work correctly.
—
Vector storage for local use
You need somewhere to store and search the vectors. For local RAG, two options are practical:
ChromaDB is the most commonly used local vector store. Pure Python, runs in-process, no server required. Stores vectors and metadata together, supports filtering by metadata. Good enough for corpora up to a few hundred thousand chunks.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("my_docs")
Qdrant (local mode) is faster and more feature-complete if you outgrow ChromaDB. It supports more sophisticated filtering, quantization of stored vectors to reduce disk usage, and handles larger corpora better. Runs as a lightweight local server.
For most personal or small-team RAG setups, ChromaDB is sufficient. Move to Qdrant when query latency becomes a problem or your collection exceeds a few hundred thousand chunks.
—
The retrieval step
At query time, the process is:
- Embed the user’s query using the same model used for indexing
- Run a similarity search against the vector store
- Return the top-k most similar chunks
- Construct a prompt that includes those chunks as context
- Pass the prompt to your LLM
How many chunks to retrieve. Start with k=4 or k=5. More chunks means more context but also more noise and higher token cost. If your answers require synthesizing across many sources, increase k. If your answers are self-contained in a single section, k=3 is often enough.
Re-ranking improves precision. A two-stage approach works well: retrieve 10-20 candidates with the embedding model, then use a cross-encoder re-ranker (such as cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers) to re-score and select the top 4-5. Re-ranking is slower but consistently improves retrieval quality, especially for ambiguous queries.
Hybrid search is worth knowing. Combining vector similarity search with keyword search (BM25) catches cases where exact term matches matter more than semantic similarity. Qdrant supports this natively. ChromaDB requires a separate keyword index.
—
A minimal working pipeline
Here is the full flow in pseudocode to make it concrete:
# Index time (run once per document set)
for doc in documents:
chunks = split_into_chunks(doc, size=512, overlap=64)
for chunk in chunks:
vector = embed(chunk.text, model="nomic-embed-text")
collection.add(ids=[chunk.id], embeddings=[vector],
documents=[chunk.text], metadatas=[chunk.metadata])
# Query time (run per user question)
query_vector = embed(user_query, model="nomic-embed-text")
results = collection.query(query_embeddings=[query_vector], n_results=5)
context = "\n\n".join(results["documents"][0])
prompt = f"""Answer the question using the context below.
Context:
{context}
Question: {user_query}"""
response = llm.generate(prompt)
That is a working RAG system. The sophistication comes from improving each piece: better chunking, a re-ranker, metadata filtering, hybrid search. But the above is not a toy – it works.
—
Common failure modes
Wrong chunks retrieved. Usually a chunking problem, not a model problem. If the answer is split across two chunks with no overlap, the retriever cannot find it. Add overlap and re-index.
Retrieval works, answer is wrong. The LLM is ignoring the retrieved context or hallucinating beyond it. Strengthen the prompt instruction: explicitly tell the model to answer only from the provided context and to say so when the answer is not there.
Slow indexing. Embedding a large corpus on CPU takes time. For a one-time index of thousands of documents, batch embedding and consider using a GPU if available. Incremental indexing (only embed new or changed documents) keeps ongoing maintenance fast.
Index drift. Documents change but the index does not update. Track document modification times and re-embed changed files. For most personal setups, a weekly re-index is sufficient.
What are you building RAG for? Curious what document types others are indexing and whether anyone has hit chunking edge cases worth sharing.