Claude RAG: Build a Retrieval-Augmented Generation App

Claude's context window is large — up to 200,000 tokens — but it is not infinite. And it has a training cutoff. Your company's internal documentation, your proprietary research, your customer knowledge base — Claude does not know any of it. You could paste all of it into every prompt, but at scale that is impractical and expensive. Review the Anthropic API getting started guide to understand how to structure requests efficiently before building a RAG pipeline.
Retrieval-Augmented Generation, or RAG, is the solution. Instead of loading all your documents into Claude's context on every request, you index them in a vector database. When a question arrives, you retrieve only the most relevant chunks and include those in Claude's prompt. Claude generates an answer grounded in those retrieved passages, not in its general training knowledge.
What is RAG and How Does It Work with Claude?
RAG (Retrieval-Augmented Generation) is a three-stage architecture: split your documents into chunks and store them as vector embeddings in a database; when a question arrives, embed it and retrieve the most semantically similar chunks; pass those chunks to Claude as context and instruct Claude to answer only from the provided information. The result is accurate, grounded answers from private documents with citations — without hallucination from training knowledge.
This project builds a complete, functional RAG system: document ingestion, chunking, embedding, vector search, and grounded answer generation with Claude.
What is RAG and Why Does It Matter?
RAG architecture has three phases:
- Indexing: Split documents into chunks, convert each chunk to a vector embedding, and store in a vector database
- Retrieval: When a question arrives, embed the question, find the most semantically similar document chunks using vector similarity search
- Generation: Pass the retrieved chunks to Claude as context and ask Claude to answer the question based on that context
The result: Claude can answer questions accurately from private, current documents without hallucinating facts it does not know. For details on how Claude processes large contexts, see the Anthropic documentation.
Prerequisites
- Python 3.9 or later
- pip install anthropic chromadb sentence-transformers pypdf2
- An Anthropic API key set as ANTHROPIC_API_KEY
ChromaDB is an open-source vector database that runs locally with no external service required. sentence-transformers provides the local embedding model.
Complete RAG Implementation
Choose Chunk Size Based on Your Content
The chunk_size parameter (words per chunk) significantly affects RAG quality. For dense technical documentation, 300-400 words per chunk with 50-word overlap works well. For narrative text or long-form reports, 600-800 words per chunk may be more appropriate to preserve context. Too small, and chunks lack sufficient context for Claude to give complete answers. Too large, and retrieval precision drops because chunks contain too many topics.
Extending to Production
- Replace ChromaDB with a managed vector database like Pinecone, Weaviate, or pgvector for production scale and persistence
- Replace sentence-transformers with a higher-quality embedding model or Anthropic's own embeddings API when released
- Add re-ranking: After vector retrieval, use a cross-encoder model to re-rank chunks by relevance before passing to Claude — improves answer quality significantly
- Implement hybrid search: Combine vector similarity search with keyword BM25 search — hybrid search consistently outperforms either approach alone
- Add document versioning: Track document versions and re-ingest when documents are updated, removing old chunks and adding new ones
Summary
RAG is the most important architectural pattern for giving Claude accurate knowledge from private or current information. The three-stage pipeline — chunk, embed, generate — is straightforward to implement and scales from a personal knowledge base to enterprise document search.
- Chunk with overlap to preserve context at boundaries
- Use a local embedding model for cost-effective indexing — inference using Claude is not needed for embeddings
- Constrain Claude via the system prompt to answer only from provided context — prevents hallucination
- Cite sources in every answer — users need to know where information came from
Next IT pro project: Project: Build an AI-Powered IT Incident Report Generator.
For the ChromaDB fundamentals used in this RAG pipeline, the ChromaDB beginner tutorial covers collections, metadata filtering, and HNSW tuning in detail. When you are ready to scale beyond ChromaDB, the vector database comparison guide helps you choose between Pinecone, ChromaDB, and pgvector for production.
The ChromaDB documentation covers production deployment options including the HTTP server mode and Docker containers. For embedding model selection, the Sentence Transformers pretrained models list is the best reference for finding a model suited to your domain and language.
This post is part of the Anthropic AI Tutorial Series. Previous post: Project: Build a Multi-Language Translator App with Claude.
External references:
