Vector Database Optimisation: Chunking & Scaling

Vector Database Production Optimisation: The Key Techniques
The most impactful optimisations for a production vector database are, in order: semantic chunking strategy (the single biggest factor in retrieval quality), embedding model selection matched to your query type, HNSW index tuning (especially search_ef, which defaults to 10 in ChromaDB but should be 50–100), two-stage retrieval with cross-encoder re-ranking, hybrid search combining vector and keyword, Redis query caching, and monitoring for zero-result and low-similarity rates.
A vector database that works well in a demo often behaves completely differently in production. You added 1,000 chunks during development. In production you have 500,000 — and queries that took 20 ms now take 800 ms. Your RAG pipeline was returning great answers in testing. In production, users complain the answers feel generic. You think it is the LLM. It is actually the retrieval.
This post covers the techniques that separate a demo-grade vector search system from a production-grade one: advanced chunking strategies, embedding model selection, HNSW index tuning, hybrid search, query caching, monitoring, and scaling patterns.
This is the advanced post in the series. You should already be comfortable with the basics from the earlier posts: What is a Vector Database?, ChromaDB Tutorial, and Build a Semantic Search Engine from Scratch.
1. Chunking Strategy — The Most Impactful Decision
The single biggest factor in retrieval quality is not your vector database or your embedding model — it is how you chunk your documents. Poor chunking causes your system to retrieve irrelevant or incomplete context regardless of how good everything else is.
Fixed-Size Chunking (Baseline)
Split by character count with overlap. Simple, predictable, and good enough for many use cases.
Problem: A 500-character boundary might land in the middle of a sentence or split a code example in half. The resulting chunk loses semantic coherence.
Sentence-Aware Chunking (Better)
Use NLTK or spaCy to split at sentence boundaries, then group sentences into target-sized chunks:
This respects sentence boundaries, dramatically improving chunk coherence.
Semantic Chunking (Best for Long Documents)
Split based on topic shifts — when the semantic similarity between consecutive sentences drops below a threshold, start a new chunk. This keeps topically coherent content together.
Which Chunking Strategy to Use
Fixed-size: fastest, adequate for homogeneous content. Sentence-aware: best balance of quality and speed for most RAG applications. Semantic chunking: highest quality for long mixed-topic documents (e.g., legal contracts, research papers) — but the extra embedding pass doubles ingestion cost.
2. Embedding Model Selection
Not all embedding models produce equally useful vectors for your task. The wrong model can degrade retrieval quality by 20–40%.
Matching Model to Task
| Task | Recommended Model | Dimensions |
|---|---|---|
| General semantic search | all-MiniLM-L6-v2 | 384 |
| High accuracy semantic search | all-mpnet-base-v2 | 768 |
| Question → document retrieval | multi-qa-MiniLM-L6-cos-v1 | 384 |
| Code search | krlvi/sentence-msmarco-bert-base-dot-v5 | 768 |
| Multilingual | paraphrase-multilingual-MiniLM-L12-v2 | 384 |
| Highest quality (paid) | text-embedding-3-large (OpenAI) | 3072 |
For RAG applications — where you embed queries and retrieve document chunks — use a model designed for asymmetric retrieval (short question → long document). multi-qa-MiniLM-L6-cos-v1 and multi-qa-mpnet-base-dot-v1 are specifically trained for this.
Benchmarking Your Embedding Model
Never assume a model will work well for your domain. Build a small evaluation set and measure retrieval quality:
Building even 20–30 test cases and running this evaluation before committing to a model can save hours of debugging poor retrieval later.
3. HNSW Index Tuning
HNSW (Hierarchical Navigable Small World) is the index algorithm used by ChromaDB, Qdrant, and Weaviate. Three parameters control the quality/speed trade-off:
M — number of bi-directional links per node. Higher M improves recall but increases memory and indexing time. Default: 16. For high-recall applications: 32–64.
ef_construction — the size of the dynamic candidate list during index build. Higher values improve recall but slow down ingestion. Default: 100. For high-recall: 200–400.
ef_search (or hnsw:search_ef in ChromaDB) — the candidate list size at query time. Higher values improve recall and slow queries. This is the most useful runtime trade-off parameter.
ChromaDB Default search_ef is 10
ChromaDB's default hnsw:search_ef is 10, which is very conservative. For production RAG applications, set it to at least 50–100. A search_ef of 10 on a large collection can miss highly relevant results and is the most common cause of poor retrieval quality in ChromaDB deployments.
pgvector HNSW tuning:
4. Hybrid Search: Combining Vector and Keyword
Pure vector search is excellent for semantic queries but can miss exact matches that keyword search would catch trivially (product codes, technical identifiers, proper nouns). Hybrid search combines both.
Hybrid Search with pgvector + Full-Text Search
Postgres has native full-text search (tsvector). Combine it with pgvector in a single query using Reciprocal Rank Fusion (RRF):
Reciprocal Rank Fusion (the 1.0 / (60 + rank) formula) normalises scores from different systems into a comparable range and combines them without needing to calibrate weights. The constant 60 dampens the impact of very high ranks.
Hybrid Search with ChromaDB (Manual)
ChromaDB does not have built-in keyword search, but you can implement hybrid search manually:
5. Query Caching
Identical or near-identical queries are common in production (users asking the same FAQ questions repeatedly). Cache results to avoid redundant embedding and retrieval operations.
For production at scale, replace the in-memory dict with Redis:
6. Re-ranking Results
Vector search returns the top-K most similar chunks, but similarity to the query vector is not always the best measure of relevance. A re-ranker (cross-encoder model) takes the query and each candidate chunk as a pair and produces a more accurate relevance score.
Re-ranking adds 50–200 ms of latency but can dramatically improve retrieval quality, especially for complex queries. It is the technique used by Cohere Rerank, Pinecone's rerank API, and Voyage AI.
7. Monitoring and Observability
You cannot optimise what you cannot measure. Track these metrics in production:
Key metrics to alert on:
- p95 query latency > 500 ms: signals HNSW tuning or hardware is needed
- Zero-result rate > 5%: suggests gaps in your document corpus
- Low similarity rate > 20%: suggests chunking, embedding model, or corpus coverage issues
- Cache hit rate < 10%: high query diversity — caching may not help much, focus on index tuning
8. Scaling Patterns
When ChromaDB Starts Slowing Down
ChromaDB runs on a single machine. When queries start taking more than 200 ms at your target collection size, consider:
- Tune HNSW first — increase
hnsw:Mandhnsw:construction_efbefore re-indexing - Add a Redis cache for repeated queries
- Partition by metadata — split one large collection into multiple smaller ones by category or date range, and route queries to the appropriate partition
- Migrate to Qdrant or pgvector when the single-machine ceiling is reached
Migrating from ChromaDB to Qdrant
Production Optimisation Checklist
- ☐ Use sentence-aware or semantic chunking instead of naive fixed-size splits
- ☐ Benchmark your embedding model on a domain-specific evaluation set before committing
- ☐ Set
hnsw:search_efto at least 50–100 in ChromaDB (default of 10 is too low) - ☐ Over-retrieve (top-20) then re-rank (select top-5) for high-quality RAG applications
- ☐ Add Redis-backed query caching for common/repeated queries
- ☐ Implement hybrid search (vector + BM25/FTS) if your corpus contains exact-match-critical content
- ☐ Monitor p95 query latency, zero-result rate, and low-similarity rate
- ☐ Review low-similarity queries weekly to identify corpus gaps
- ☐ Plan migration to Qdrant or Pinecone before you hit ChromaDB's single-machine ceiling
Key Takeaways
- Chunking strategy is the highest-impact optimisation — sentence-aware or semantic chunking consistently outperforms fixed-size splits
- Embedding model selection matters — use asymmetric retrieval models for question-to-document RAG
- ChromaDB's default
search_ef = 10is the most common source of poor recall in production — set it to 50–100 - Two-stage retrieval (ANN + cross-encoder re-ranking) gives the best quality at acceptable latency
- Hybrid search (vectors + keyword) handles both semantic and exact-match queries correctly
- Monitor the metrics that matter: zero-result rate and low-similarity rate tell you about corpus quality; p95 latency tells you about infrastructure
Vector Database Series — Complete
You have now completed the full Vector Database Series:
- What is a Vector Database? The Complete Beginner's Guide
- ChromaDB Tutorial: The Complete Beginner's Guide
- ChromaDB vs Pinecone vs pgvector: Which Should You Use?
- Build a Semantic Search Engine from Scratch
- Vector Database Optimisation for Production ← you are here
For the next natural step — connect your vector database to an LLM and build a complete RAG pipeline — see Project: Build a RAG App with Claude.
For primary documentation on the tools used in this post: ChromaDB HNSW configuration reference, Pinecone performance optimisation guide, and the pgvector HNSW indexing documentation. Related posts in this series: What is a Vector Database? and ChromaDB vs Pinecone vs pgvector.
External references:
