Building a Context-Aware Conversational RAG Assistant with LangChain in Python

Samul Black
Oct 12
12 min read

A Step-by-Step Guide to Integrating Google’s GenAI SDK with LangChain for Context-Aware Conversations

Retrieval-Augmented Generation (RAG) is a key technique that enables language models to retrieve knowledge from external data sources before generating answers.In this post, we’ll walk through how to build a Conversational RAG Assistant using LangChain, Google Gemini (via the official SDK), and Chroma for vector storage.

You’ll learn how to:

Load your Gemini API key securely.
Integrate Gemini models with LangChain.
Use website data for retrieval.
Make the assistant remember context across turns.

Let’s dive in.

Building a Conversational RAG Assistant using LangChain and Google Gemini - colabcodes

1. Environment Setup and API Initialization

Before building the Conversational RAG assistant, the environment must be properly configured and connected to Google’s Gemini API. This step involves securely loading the Gemini API key, initializing the official Google GenAI client, and preparing LangChain-compatible components for later use in the pipeline.

Importing the Required Modules

The script begins by importing several modules that handle environment management, API connectivity, and model initialization.

from dotenv import load_dotenv
import os
import google.genai as genai  # Official Google GenAI SDK
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings

Here is what each import does:

dotenv: Allows secure loading of sensitive configuration values (like API keys) from a .env file. This prevents credentials from being hardcoded into your source code.
os: Provides operating system utilities, used here to access environment variables after loading them.
google.genai: The official Google Generative AI Python SDK, which provides a direct interface to Gemini models. This SDK is developed and maintained by Google, ensuring compatibility with the latest Gemini releases.
langchain_google_genai: A LangChain integration package that allows Google’s Gemini models to be used within LangChain workflows, such as retrieval-augmented generation (RAG) and conversational chains.

This setup ensures that both the official SDK and LangChain’s higher-level abstractions can be used seamlessly in the same environment.

Creating and Configuring the .env File

To authenticate with Google’s Generative AI API, you must store your API key securely in an environment file rather than embedding it directly in your Python code. In the same directory as your script, create a new file named .env and include the following line:

GEMINI_API_KEY="YOUR_ACTUAL_API_KEY"

This file should not be shared publicly or included in version control systems such as GitHub. The python-dotenv library will automatically load this file at runtime and make the API key available as an environment variable inside your Python session.

Loading and Validating the API Key

Once the .env file is created, the script loads it and retrieves the API key. This ensures that the Gemini services can be accessed securely.

load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

To confirm that the key has been loaded successfully and meets a minimum length requirement, the script includes a validation check and a debug print statement:

print(f"DEBUG: GEMINI_API_KEY loaded status: {'Key Loaded' if GEMINI_API_KEY and len(GEMINI_API_KEY) > 5 else 'Key Missing or Short'}")

if not GEMINI_API_KEY or len(GEMINI_API_KEY) < 20:
    raise ValueError(
        "GEMINI_API_KEY environment variable not set or too short. "
        "Please ensure your .env file exists and contains a valid key: GEMINI_API_KEY='YourActualKeyValue'."
    )

This verification step helps you identify configuration problems early, such as missing environment variables, truncated keys, or typos in the .env file. If the key is missing or invalid, the script stops execution and provides a clear error message to guide you toward the solution.

Initializing the Gemini API Client and LangChain Models

After verifying the API key, the next step is to initialize both the official Google GenAI client and LangChain-compatible models. This creates a direct connection to Gemini services and prepares components for embeddings and conversational processing.

genai_client = genai.Client(api_key=GEMINI_API_KEY)
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", google_api_key=GEMINI_API_KEY)
embeddings = GoogleGenerativeAIEmbeddings(model="text-embedding-004", google_api_key=GEMINI_API_KEY)

Each line serves a specific purpose:

genai.Client(api_key=GEMINI_API_KEY) initializes a client using the official Google SDK. This provides access to Gemini’s native features, enabling low-level operations such as text generation, model listing, and configuration control.
ChatGoogleGenerativeAI creates a LangChain-compatible chat model instance. This layer integrates Gemini’s conversational capabilities into LangChain’s standardized API for prompts, message passing, and output parsing.
GoogleGenerativeAIEmbeddings initializes a Gemini embedding model that converts text into numerical vectors. These embeddings are later used for semantic search, similarity matching, and retrieval tasks within the RAG system.

By explicitly passing the API key to both the official client and LangChain wrappers, the script ensures consistent authentication and model access across both frameworks.

Importance of This Setup

This initialization step is critical because it establishes a unified environment where both Google’s official SDK and LangChain’s RAG ecosystem can operate together.

The Google SDK provides robust, low-level access to the Gemini models, ensuring direct control and support for the latest API features.
LangChain’s integration provides higher-level abstractions such as prompt templates, document loaders, and retrieval chains, making it easier to build scalable applications that combine structured retrieval and generative reasoning.

By combining these two components, developers can take advantage of Gemini’s performance and LangChain’s modular design, enabling the creation of advanced conversational systems that can understand, retrieve, and respond using real-world knowledge sources.

2. Loading Website Data into a Vector Store

After initializing the environment and Gemini API, the next step is to enable your assistant to retrieve and use external knowledge from your website or any other online data source. This is achieved by loading content, transforming it into semantic embeddings, and storing it inside a vector database for efficient retrieval.

This process establishes the foundation of Retrieval-Augmented Generation (RAG) — a framework that lets your language model access factual, domain-specific information beyond its training data.

Importing the Required LangChain Components

To load, preprocess, and store website data, we import several essential LangChain modules:

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

Here’s what each of these modules does:

WebBaseLoader: A document loader that fetches raw HTML content directly from web pages and converts it into a text-based document format compatible with LangChain’s processing tools.
Chroma: A lightweight, high-performance vector database used to store and query document embeddings. It supports similarity search, which is essential for retrieving contextually relevant information.
RecursiveCharacterTextSplitter: A utility that breaks long text documents into smaller, manageable chunks while preserving sentence and paragraph coherence. This ensures that embeddings remain semantically meaningful.
ChatPromptTemplate and MessagesPlaceholder: Though not used directly in this step, these modules are part of the conversational layer. They allow flexible prompt construction and memory management when integrating the retriever into a chat-based system.

Defining the initialize_vectorstore() Function

The following function automates the process of loading, chunking, embedding, and storing web content into a vector store.

def initialize_vectorstore(urls: list[str]):
    loader = WebBaseLoader(urls)
    docs = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(docs)

    vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
    return vectorstore.as_retriever()

Let’s break down each step in detail:

Loading Website Data
loader = WebBaseLoader(urls) docs = loader.load()
The WebBaseLoader fetches and processes the content from each URL provided in the list. It automatically removes unnecessary HTML tags and structures the text into document objects. This converts your live website pages into analyzable text documents that can be embedded later.
Splitting Long Documents into Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) splits = text_splitter.split_documents(docs)
Large web pages or blog articles can contain thousands of characters, which exceed the token limits of most language models. The RecursiveCharacterTextSplitter breaks them into overlapping text segments — each about 1,000 characters long, with a 200-character overlap. This overlap preserves context continuity between chunks, allowing the retriever to return coherent responses even if the relevant information spans multiple sections.
Creating a Vector Store with Embeddings
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
The text chunks are converted into high-dimensional vectors (embeddings) using the Gemini embedding model initialized earlier. These vectors capture semantic meaning — similar pieces of text will have nearby vector representations. The Chroma database then stores these embeddings, enabling fast similarity-based lookups later during query time.
Returning a Retriever Object
return vectorstore.as_retriever()
Instead of returning the entire database object, this function provides a retriever interface, which simplifies querying. The retriever can later be plugged into LangChain’s conversational chain to fetch relevant text segments dynamically during a chat.

Initializing the Retriever with Website URLs

With the function defined, you can now specify the URLs you want your assistant to access. These could include your homepage, service pages, or documentation.

WEBSITE_URLS = [
    "https://www.colabcodes.com/",
    "https://www.colabcodes.com/full-stack-software-development-and-ai-services"
]
retriever = initialize_vectorstore(WEBSITE_URLS)

This command loads the listed web pages, splits them into text segments, embeds their content, and creates a retriever that can efficiently search across them.

This process forms the knowledge retrieval backbone of a RAG pipeline. While large language models like Gemini possess broad general knowledge, they do not inherently know about your specific business, website, or proprietary data.

By embedding your website’s content into a vector store, you give your AI assistant real-time access to up-to-date, domain-specific information, ensuring its responses are accurate, contextually relevant, and grounded in your own data.

This architecture allows your chatbot to:

Retrieve factual information from your website instead of relying on model memory alone.
Stay synchronized with recent updates on your site without retraining the model.
Provide users with responses that reflect your actual services, content, and terminology.

In essence, this step transforms a generic AI model into a context-aware assistant tailored to your organization’s knowledge base.

3. Making the RAG Assistant Context-Aware

At this stage, the RAG system can already retrieve relevant data from your website.However, to make the assistant truly conversational, it must also remember previous exchanges and interpret follow-up questions in context — just like a human.

This is achieved through history-aware retrievers and smart prompt templates, which help the model maintain continuity throughout the dialogue.

A standard chatbot can only answer isolated questions, but a Conversational RAG system reformulates user queries dynamically based on prior conversation history.

Step 3.1: Reformulating Follow-Up Queries

Users rarely repeat full questions in a conversation. Instead, they ask follow-ups like:

“What about its pricing?”“How does that apply to startups?”“What does the ‘Security First’ value mean?”

To handle such queries effectively, the model must rewrite follow-up questions into self-contained ones that make sense even without the chat history.

We begin by defining a system prompt that guides the language model on how to perform this contextual reformulation.

contextualize_q_system_prompt = (
    "Given a chat history and the latest user question which might reference context, "
    "formulate a standalone question that can be understood without chat history."
)

This prompt ensures that if the user says something like “What does the first one mean?”, the assistant understands it refers to the company’s first value (for example, “Security First”) from the previous turn.

Next, a demonstration function shows this mechanism in action:

def demonstrate_query_contextualization():
    demo_chat_history = [
        HumanMessage(content="What are the core values of the company?"),
        AIMessage(content="The core values include 'Security First' and 'Customer Trust'."),
    ]
    follow_up_question = "What does the 'Security First' one entail?"

    contextualization_prompt = ChatPromptTemplate.from_messages([
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ])
    contextualization_chain = contextualization_prompt | llm | StrOutputParser()

    reformulated_query = contextualization_chain.invoke({
        "input": follow_up_question,
        "chat_history": demo_chat_history
    })

    print(f"Reformulated Query: {reformulated_query}")

How this works:

Chat History Simulation – The variable demo_chat_history represents previous conversation turns between the human and the assistant.
Follow-Up Question – The variable follow_up_question is a short, context-dependent question that would otherwise confuse a stateless model.
Prompt Construction – ChatPromptTemplate combines system instructions, past messages (MessagesPlaceholder), and the user’s latest input.
Chaining the Components – The prompt is piped through the LLM and a simple string parser using LangChain’s | operator, forming a mini “chain” that processes input and outputs text.
Reformulation Output – The chain produces a new, context-independent version of the question, making it easier for downstream components (like retrievers) to search relevant documents.

This step allows the assistant to interpret user intent even when questions are abbreviated or dependent on prior messages. It ensures the flow of the conversation feels natural and coherent.

Step 3.2: Creating a History-Aware Retriever

After establishing the logic for query reformulation, the next step is to embed that intelligence directly into the retriever. LangChain provides a built-in helper called create_history_aware_retriever, which automatically rewrites context-dependent questions before performing a vector search.

from langchain.chains import create_history_aware_retriever

history_aware_retriever = create_history_aware_retriever(
    llm,
    retriever,
    ChatPromptTemplate.from_messages([
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ])
)

How this works:

The LLM is used to reformulate queries just as in the previous demonstration.
The retriever (initialized earlier with website data) fetches relevant document chunks based on the rewritten query.
The ChatPromptTemplate ensures that each query is first reinterpreted in context before being passed into the retrieval step.

The result is a context-aware retriever — one that understands references, pronouns, and follow-ups — dramatically improving the accuracy of contextual search.

Step 3.3: Generating Answers with Context

Once the retriever can handle contextual queries, the final step is to integrate it into a question-answering chain.This chain combines retrieved documents with the user’s query to produce a grounded, context-rich response.

We start by defining a system prompt that tells the assistant how to use retrieved data.

qa_system_prompt = (
    "You are an AI assistant for the following websites/pages: {urls_list_str}. "
    "Use the retrieved context to answer the user's question. "
    "If you cannot find the answer in the provided context, say so politely."
)

This ensures the assistant’s answers remain factually grounded in the data retrieved from your own website, rather than relying on model assumptions.

Next, we create a document chain, which merges the system prompt, retrieved context, and chat history into a structured reasoning flow.

from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(
    llm,
    ChatPromptTemplate.from_messages([
        ("system", qa_system_prompt + "\n\n{context}"),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]).partial(urls_list_str=", ".join(WEBSITE_URLS))
)

Step-by-step explanation:

Combining Sources – The create_stuff_documents_chain function binds together the Gemini model, the structured system prompt, and retrieved content.
Prompt Customization – The placeholder {context} is automatically replaced with the text from the retrieved website data.
Dynamic URL Insertion – The partial() method fills the {urls_list_str} variable with your actual website URLs to make responses more transparent.
Integrated Context Handling – The model now has access to both prior conversation and relevant document snippets.

Finally, everything merges into a unified Conversational RAG chain that ties together retrieval, contextualization, and generation.

from langchain.chains import create_retrieval_chain

conversational_rag_chain = create_retrieval_chain(
    history_aware_retriever, 
    document_chain
)

This chain ensures that:

The retriever provides the most relevant content based on user intent and chat history.
The document chain generates coherent, grounded responses using that context.
The conversation remains consistent, human-like, and domain-specific throughout the session.

Context awareness is what distinguishes a basic FAQ bot from an intelligent conversational assistant.With this setup, the assistant can:

Understand pronouns, references, and incomplete sentences in multi-turn conversations.
Maintain continuity across questions without losing track of previous exchanges.
Generate grounded, evidence-based answers using content directly from your website.
Politely decline to answer when no relevant information exists — ensuring credibility and factual accuracy.

In practice, this creates a human-like dialogue experience, where the assistant remembers what’s been said, interprets meaning from prior turns, and delivers consistent, data-backed answers.

4. Running the Conversational Assistant

After setting up the retrieval pipeline and context-awareness mechanisms, the final step is to bring the assistant to life — enabling it to interact dynamically with users while maintaining memory across multiple turns.

In this phase, you’ll see how to run a multi-turn conversation where the assistant remembers context, reformulates follow-up queries, and retrieves relevant website content seamlessly.

Setting Up the Chat Loop

We start by importing message classes from LangChain’s core messaging system.These classes help maintain structured conversation history that can be reused across turns.

from langchain_core.messages import HumanMessage, AIMessage

A simple list is used to track all messages exchanged between the human and the assistant during the session:

chat_history = []

This list is passed into every call to the conversational chain, so the model always has access to prior exchanges.

Next, we define the function that powers each conversational turn:

def ask_assistant(question: str):
    response = conversational_rag_chain.invoke({
        "input": question,
        "chat_history": chat_history,
    })
    answer = response["answer"]
    print(f"AI Assistant: {answer}")
    chat_history.extend([
        HumanMessage(content=question),
        AIMessage(content=answer)
    ])

How this function works:

The user’s question is passed to the Conversational RAG Chain, which automatically performs query contextualization, retrieval, and answer generation.
The chain returns a structured response object containing the assistant’s generated answer and other metadata.
The answer field is printed for display and then appended to the chat_history list, ensuring the assistant can recall it in subsequent turns.
Each turn enriches the context — allowing the assistant to form a memory of the conversation.

Example Conversation

To test the system, we can run two queries consecutively:

ask_assistant("I need assistance with my website")
ask_assistant("Is it regulated by the government?")

Output:
AI Assistant: Yes, ColabCodes can assist with your website. They offer "Website - Development" services, focusing on crafting fast, modern, and scalable websites tailored to elevate your brand online.
AI Assistant: I apologize, but the provided information does not mention whether ColabCodes is regulated by the government.

Here’s what happens internally:

Turn 1: The assistant retrieves information from your website and responds to the user’s request for help.
Turn 2: When the user asks, “Is it regulated by the government?”, the assistant understands this question refers back to the website mentioned earlier. It reformulates the query into a self-contained one (e.g., “Is the ColabCodes website regulated by the government?”), fetches relevant data, and produces a contextually aware response.

This ability to link questions across turns is what transforms a static retrieval system into a conversational AI assistant.

Behind the Scenes: What’s Happening

During each conversational turn, several integrated components work together:

Official Google GenAI SDK – Handles the low-level API communication with Gemini models. It ensures stable, optimized responses from Google’s latest generative models.
LangChain’s Conversational RAG Pipeline – Manages contextual prompts, history tracking, and chaining logic to link multiple steps like query reformulation, retrieval, and answer generation.
Chroma Vector Store – Serves as the local, high-performance storage for your text embeddings. It allows semantic search over your website content without relying on external databases.
Query Contextualization – Automatically rewrites follow-up questions based on conversation history, enabling smooth, human-like dialogue flow.

Together, these components form a robust and modular AI assistant architecture capable of scaling from small website chatbots to enterprise-grade knowledge systems.

Extending the System

Once the foundation is in place, you can extend this architecture for broader applications and richer features:

Multi-source Retrieval: Connect additional data types such as PDFs, internal documentation, CSV files, or APIs to expand your assistant’s knowledge base.
Persistent Memory: Use external storage (like Redis, SQLite, or PostgreSQL) to save chat history for long-term sessions or returning users.
Caching for Efficiency: Implement caching layers to avoid redundant embedding and retrieval steps, improving response speed.
Custom Interfaces: Integrate your assistant into web applications or dashboards using frameworks like Streamlit, Flask, or Next.js.

This architecture can be customized to fit a wide variety of real-world applications, such as:

AI-Powered Website Support Bots: Offer instant, context-aware support based on live website content.
Internal Knowledge Assistants: Enable employees to query internal policies, documentation, or project data naturally.
Personalized Learning Assistants: Guide learners through technical or research topics with adaptive, context-aware responses.
Conversational Product Docs: Let users ask questions about software documentation and receive conversational explanations.

Each of these use cases leverages the same core principles — retrieval, grounding, and conversational context — to deliver accurate and user-friendly AI experiences.

Conclusion

At this point, we’ve built a fully functional Conversational RAG system powered by Google Gemini and LangChain.Your assistant can intelligently retrieve relevant data, remember previous exchanges, and deliver natural responses — all while staying grounded in your own content.

This setup provides a strong starting point for more advanced systems. By extending it with long-term memory, multi-source retrieval, and a user interface, you can scale this foundation into an enterprise-ready conversational AI assistant for real-world use.

Learn, Explore & Get Support from Freelance Experts

ColabCodes