Vector Databases with Chroma in Python: A Practical Guide
- Samul Black

- 23 hours ago
- 9 min read
Vector databases have become a core building block in modern AI systems, especially for applications like semantic search, document retrieval, and retrieval-augmented generation. Instead of relying on exact keyword matching, vector databases work with embeddings that capture the meaning and context of text, enabling far more intelligent search and retrieval.
In this guide, we take a fully practical approach to understanding vector databases using Chroma in Python. We will start by scraping textual content from the ColabCodes website, clean and structure that data, and then convert it into vector embeddings using a transformer-based embedding model. These embeddings will be stored in a Chroma vector database, allowing us to perform similarity searches over the scraped content.
By the end of this tutorial, you will have a clear understanding of how raw website data can be transformed into a searchable vector store. More importantly, you will see how this pipeline fits into real-world AI workflows such as semantic search systems and LLM-powered applications.

Introduction to Vector Databases with Chroma
Vector databases enable a fundamentally different way of working with data compared to traditional databases. Instead of storing and querying information using exact values or keywords, they operate on numerical representations that capture semantic meaning. This approach has become essential for modern AI applications that deal with unstructured data at scale.
A vector database is a specialized system designed to store, index, and query high-dimensional vectors, commonly known as embeddings. These vectors are generated by machine learning models and represent the semantic properties of data such as text, images, or audio. When working with textual data, embeddings allow content to be transformed into numerical form in a way that preserves meaning and context. As a result, similar pieces of content produce vectors that are closer together in vector space, while unrelated content appears farther apart. Key characteristics of vector databases include:
Semantic representation of data - Text is represented as vectors that encode meaning rather than surface-level words, making semantic similarity measurable.
Efficient similarity search - Vector databases use distance metrics such as cosine similarity or Euclidean distance to quickly find the most relevant vectors for a given query.
Scalability for unstructured data - They are optimized to handle large volumes of embeddings efficiently, even when datasets grow significantly.
Metadata support - Vectors can be stored alongside metadata such as document source, section, or tags, enabling filtered and contextual retrieval.
Because of these capabilities, vector databases are widely used in applications like semantic search engines, recommendation systems, document clustering, and retrieval-based AI pipelines.
Chroma Vector Database
Chroma is an open-source vector database built with a strong emphasis on simplicity and ease of integration, particularly for Python developers. It abstracts away much of the complexity typically associated with vector storage and similarity search, allowing developers to focus on building AI-driven features.
Chroma provides a clean and intuitive interface for managing embeddings, making it suitable for both learning and production use cases. Its design aligns well with modern AI workflows where embeddings are generated dynamically and queried frequently.
Notable features of Chroma include:
Python-first API - Chroma integrates directly into Python applications, making it easy to use in data science and machine learning projects.
Flexible storage options - Embeddings can be stored in memory during experimentation or persisted locally for reuse across sessions.
Document-centric design - Chroma allows embeddings to be stored alongside the original text and metadata, which simplifies retrieval and downstream processing.
Fast similarity queries - Optimized indexing enables efficient nearest-neighbor searches across large embedding collections.
Seamless AI integration - Chroma fits naturally into pipelines involving embedding models, semantic search systems, and large language models.
In this tutorial, Chroma acts as the vector storage layer for embeddings generated from scraped website content. By storing and querying embeddings derived from real-world web data, you gain a practical understanding of how vector databases support intelligent retrieval and search workflows.
Hands-On Implementation: Vector Databases with Chroma in Python
In this section, we move from concepts to implementation and build a complete vector database pipeline using Python and Chroma. The goal is to take real-world website data and transform it into a searchable vector store that supports semantic retrieval.
We will begin by collecting textual content from the ColabCodes website, followed by cleaning and structuring the extracted data. Next, we will generate embeddings from this text using an embedding model and store those embeddings in a Chroma vector database. Each step is broken down clearly, with code and explanations, so the entire workflow is easy to follow and reproduce.
Step 1: Scraping Textual Data from the Website
The first step in building our vector database pipeline is collecting textual data from the website. Since our goal is to create embeddings from meaningful content, we need to extract clean, readable text rather than raw HTML. For this task, we use BeautifulSoup, a popular Python library for parsing HTML and XML documents. BeautifulSoup allows us to navigate the structure of a web page, locate relevant elements such as headings, paragraphs, and lists, and extract their text content in a structured way.
The typical workflow involves:
Sending an HTTP request to fetch the web page content
Parsing the HTML response using BeautifulSoup
Identifying content-bearing tags like <p>, <h1>–<h6>, and <li>
Removing navigation menus, scripts, styles, and other non-essential elements
Combining the extracted text into a clean, readable format
This cleaned text forms the raw data that will later be converted into embeddings. Careful scraping is important because noisy or poorly structured text directly affects embedding quality and, in turn, the accuracy of semantic search results.
To keep this guide focused on vector databases and embeddings, we will not dive deeply into the mechanics of web scraping here. A complete, step-by-step tutorial covering BeautifulSoup setup, page traversal, text cleaning, and best practices is available in a dedicated article:
Once the textual content has been scraped and cleaned, we can move on to preparing it for embedding and storage inside the Chroma vector database.
Step 2: Setting Up Dependencies and Configuration
Before generating embeddings and storing them in a vector database, we need to set up the required libraries and define a few configuration parameters. This step ensures that our pipeline is structured, reusable, and easy to modify as the project grows.
We begin by importing the core Python libraries needed for data handling, embedding generation, and vector storage.
For embedding generation, we rely on a pre-trained transformer-based sentence embedding model. This model converts chunks of text into fixed-length numerical vectors that preserve semantic meaning, making them suitable for similarity search.
To store and query these embeddings, we use Chroma, which serves as our vector database backend. Chroma handles vector indexing, persistence, and fast similarity retrieval, allowing us to focus on the application logic rather than database internals.
We also define several configuration variables that control how the pipeline behaves:
Input data source - The JSON file containing scraped website content serves as the starting point for the pipeline.
Vector database storage path - A local directory is used to persist the Chroma database so embeddings can be reused across sessions.
Embedding model selection - A lightweight yet effective sentence transformer model is chosen to balance performance and accuracy.
Chunk size and overlap - Text is split into manageable chunks before embedding. Chunk overlap helps preserve context across adjacent sections of text.
Collection name - A named collection inside Chroma organizes embeddings related to the website content.
Defining these values upfront makes the pipeline easier to tune. For example, chunk size can be adjusted to optimize retrieval quality, or a different embedding model can be swapped in without changing the rest of the code.
# Dependencies
import json
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from typing import List, Dict
from collections import defaultdict
# Configuration
JSON_FILE = "colabcodes_pages.json"
VECTOR_DB_PATH = "./chroma_db"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 400
CHUNK_OVERLAP = 50
COLLECTION_NAME = "website_content"Step 3: Chunking Text for Embeddings
Long documents need to be split into smaller pieces before generating embeddings. Chunking improves retrieval accuracy, keeps embedding sizes consistent, and allows relevant sections of a page to be retrieved independently.
This function splits text into overlapping word-based chunks, helping preserve context across boundaries. Very small chunks are skipped to avoid low-quality embeddings, and each chunk is stored with useful metadata such as the source URL and position within the document.
By attaching metadata to every chunk, we can later trace search results back to their original page and location. These prepared chunks form the input for the embedding model in the next step.
def chunk_text(text: str, url: str) -> List[tuple]:
"""Split text into overlapping chunks with metadata."""
words = text.split()
chunks = []
for i in range(0, len(words), CHUNK_SIZE - CHUNK_OVERLAP):
chunk = " ".join(words[i:i + CHUNK_SIZE])
if len(chunk.strip()) < 50:
continue
metadata = {
"url": url,
"chunk_index": len(chunks),
"word_count": len(chunk.split())
}
chunks.append((chunk, metadata))
return chunksStep 4: Preparing Documents and Initializing the Chroma Collection
In this step, we process all scraped pages and prepare the data for storage in the vector database. Each page is iterated over, chunked using the previously defined function, and converted into a format that Chroma can work with. For every text chunk, we collect:
The chunk content itself
Associated metadata such as source URL and chunk index
A unique identifier for tracking within the collection
These elements are stored in separate lists, which are later passed to Chroma for insertion. Next, we initialize the embedding model that will convert text chunks into numerical vectors. A lightweight sentence transformer model is used to balance speed and semantic quality.
We then create a persistent Chroma client, allowing the vector database to be saved locally and reused across sessions. To ensure a clean run, any existing collection with the same name is removed before creating a new one.
Finally, a fresh Chroma collection is created to store the website content embeddings. This collection acts as the foundation for semantic search and retrieval in the upcoming steps.
# Process all pages
documents = []
metadatas = []
ids = []
for url, text in pages_dict.items():
for chunk, metadata in chunk_text(text, url):
documents.append(chunk)
metadatas.append(metadata)
ids.append(f"doc_{len(documents)-1}")
model = SentenceTransformer(EMBEDDING_MODEL)
client = chromadb.PersistentClient(path=VECTOR_DB_PATH)
# Reset collection
try:
client.delete_collection(name=COLLECTION_NAME)
except:
pass
collection = client.create_collection(
name=COLLECTION_NAME,
metadata={"description": "Semantic search for website content"}
)Step 5: Generating Embeddings and Storing Them in Chroma
With the text chunks prepared and the Chroma collection initialized, we can now generate embeddings and store them in the vector database. To handle large datasets efficiently, embeddings are created and inserted in batches rather than all at once.
The documents are processed in fixed-size batches to:
Reduce memory usage
Keep embedding generation stable
Avoid overwhelming the vector database during insertion
For each batch, the sentence transformer model converts text chunks into numerical embeddings. These embeddings capture the semantic meaning of the content and are immediately stored in the Chroma collection along with their original text, metadata, and unique identifiers.
BATCH_SIZE = 100
for i in range(0, len(documents), BATCH_SIZE):
batch_end = min(i + BATCH_SIZE, len(documents))
batch_docs = documents[i:batch_end]
batch_ids = ids[i:batch_end]
batch_metadatas = metadatas[i:batch_end]
# Generate embeddings
batch_embeddings = model.encode(batch_docs, show_progress_bar=False)
# Store in ChromaDB
collection.add(
documents=batch_docs,
embeddings=batch_embeddings.tolist(),
metadatas=batch_metadatas,
ids=batch_ids
)By batching the insertion process, the pipeline remains scalable and efficient, even when working with a large number of website pages. Once this step completes, the vector database is fully populated and ready to support semantic search and retrieval.
Below is a preview of what gets created locally.

Next Steps: Querying and Using the Vector Database
At this stage, your Chroma vector database is fully populated with embeddings for all website content. Each chunk of text is stored along with metadata, making it easy to retrieve information based on semantic similarity rather than exact keywords.
Two of the most practical applications of your vector database are:
Semantic Search: This allows you to search your website content based on meaning rather than exact words. For example, a user query doesn’t need to match the page text verbatim; the system retrieves results that are conceptually similar. Semantic search is ideal for building intelligent search engines, FAQ assistants, or recommendation systems. Learn more in our detailed guide What Is a Semantic AI Search Engine? A Practical Guide with Python.
Retrieval-Augmented Generation (RAG): RAG combines the power of large language models with your vector database. When a user asks a question, the system first retrieves the most relevant chunks of content from your database, then provides these as context to the language model. This enables accurate, context-aware responses even on domain-specific data. Read in our detailed guide Building a Context-Aware Conversational RAG Assistant with LangChain in Python.
A typical workflow involves encoding the query using the same embedding model, then using Chroma’s query function to fetch the most relevant chunks along with their metadata. This setup makes it possible to build applications that return precise, semantically meaningful answers from your own content.
Conclusion
In this tutorial, we demonstrated how to build a complete vector database pipeline using Python and Chroma, starting from raw website content. You learned to:
Scrape textual content from a website efficiently using BeautifulSoup.
Split text into overlapping chunks with metadata to preserve context.
Generate embeddings using a transformer-based model.
Store and organize these embeddings in a persistent Chroma vector database.
Prepare the database for semantic search and downstream AI workflows.
This hands-on workflow bridges the gap between raw unstructured data and intelligent retrieval, providing a foundation for building semantic search engines, recommendation systems, or RAG-based applications. By following these steps, you now have a fully functional vector database pipeline that can scale to larger datasets and more complex AI tasks.





