Artificial IntelligenceDatabasesProjects

Build a Semantic Search Engine from Scratch with Python

TT
TopicTrick
Build a Semantic Search Engine from Scratch with Python

What Is Semantic Search and How Does It Work?

Semantic search finds documents by meaning rather than keyword overlap. Text is converted into dense vector embeddings using a sentence-transformer model, then stored in a vector database. When a query arrives, it is embedded using the same model and the system returns documents whose vectors are closest in meaning — measured by cosine similarity. Queries and results do not need to share any words.

Keyword search has a fundamental flaw: it matches words, not meaning. Your users do not search for keywords — they describe what they need. A user who types "how to cancel my account" is looking for the same article as one who types "steps to close my subscription." Keyword search misses one of those. Semantic search matches both.

In this project you will build a complete semantic search engine from scratch. By the end you will have a working system that can ingest documents from text files, chunk and embed them, store embeddings in ChromaDB, and expose a clean Python search interface that returns results ranked by meaning — not by keyword overlap.

This is a standalone project. If you want to understand the underlying theory before diving in, read What is a Vector Database? and ChromaDB Tutorial first.


What You Will Build

A five-component semantic search system:

  1. Document loader: reads text files or plain strings into a standard format
  2. Text chunker: splits long documents into overlapping chunks suitable for embedding
  3. Embedding pipeline: converts chunks to vector embeddings using a local model
  4. Vector store: persists embeddings in ChromaDB with metadata
  5. Search interface: accepts natural-language queries and returns ranked results

Prerequisites

bash

Python 3.10 or later. No API keys required — everything runs locally.


Step 1: Document Loader

Start with a clean data model. Every document in the system has a source, content, and metadata.

python

Step 2: Text Chunker

Long documents can exceed embedding model context limits (typically 256–512 tokens for most sentence-transformers). Chunking splits documents into smaller, overlapping segments. The overlap ensures that sentences split across a chunk boundary are represented in at least one complete chunk.

python

Choosing Chunk Size

For support articles or documentation (dense, factual content): 400–600 characters. For narrative text or long-form articles: 800–1000 characters. The overlap (100–150 characters) ensures context at boundaries is not lost. Smaller chunks improve retrieval precision; larger chunks give the LLM more context per result.


    Step 3: Vector Store Wrapper

    Wrap ChromaDB with a clean interface that knows nothing about the chunker or loader:

    python

    Step 4: Search Engine — Putting It All Together

    python

    Step 5: Run It — Full Working Demo

    python

    Expected output:

    text

    Every result is found through meaning, not keywords. The query "will I get charged if I stop subscribing" contains none of the words in the matching document — it finds it through semantic similarity alone.


    Step 6: Chunking Long Documents

    To see chunking in action, try a longer document:

    python

    The chunker splits this 800-character document into two overlapping chunks. Both are indexed. The query "who made Python and when" retrieves the chunk containing the creation details with high similarity.


    Step 7: Re-indexing and Document Updates

    The engine handles content updates cleanly because it uses content-derived IDs:

    python

    In a production ingestion pipeline, track a document's source identifier and call delete_source() before re-ingesting when the source content changes.


    Adding a Simple REST API with FastAPI

    To expose your search engine as an HTTP endpoint:

    bash
    python
    bash

    Test it:

    bash

    Project File Structure

    text

    Key Takeaways

    • Semantic search finds documents by meaning, not keywords — queries and documents do not need to share words
    • Chunking is essential for long documents — split with overlap to avoid losing context at boundaries
    • Content-derived IDs (hashed source + index) make your ingestion pipeline safely idempotent
    • Setting a min_similarity threshold (0.3–0.4) filters out irrelevant low-confidence results
    • The same engine can be extended to support PDF ingestion (add pypdf2), web scraping (add BeautifulSoup), or multiple languages (swap to a multilingual sentence-transformers model)
    • Wrapping the engine with FastAPI creates a production-ready semantic search microservice in under 30 lines

    What's Next in the Vector Database Series

    This post is part of the Vector Database Series. Previous post: ChromaDB vs Pinecone vs pgvector: Which Vector Database Should You Use?.

    To add a Claude-powered Q&A layer on top of this search engine, see Claude RAG: Retrieval Augmented Generation. For a comparison of vector database options for scaling this project, see What Is a Vector Database? and ChromaDB Tutorial for Beginners.

    External Resources