Functional Modes of Large Language Models (LLMs) – Explained with Gemini API Examples
- Oct 16, 2025
- 12 min read
Updated: 1 day ago
Large Language Models (LLMs) are no longer limited to generating text. They can reason, code, perceive, plan, and act. Frameworks like Gemini (by Google DeepMind) represent a new generation of multi-functional, multimodal AI systems — capable of operating in diverse modes ranging from text generation and code reasoning to function calling and autonomous agentic behavior.
This article explores the main functional modes of LLMs, focusing on how they work conceptually and how developers can use the Gemini API to implement each mode.
By the end of this blog, you’ll have learned:
The core operational modes of modern LLMs
The theoretical foundations behind each mode
Step-by-step Python code examples for Gemini API
How to integrate multiple modes (like embeddings + multimodal + tool calling)
Real-world use cases and best practices
This tutorial blends deep theoretical explanation with hands-on coding, ideal for developers who want to move from “chatbot-level” AI to system-level intelligent applications.

Theoretical Aspects: Understanding Functional Modes of LLMs
To truly harness the power of modern LLMs, it’s essential to understand how they operate internally. Functional modes define the behavior and capabilities of a model — from generating natural language to executing complex, multi-step tasks. In this section, we’ll explore the theory behind each mode, the architectural principles that enable them, and how these concepts translate into real-world AI applications.
1. What Are Functional Modes?
A functional mode defines how an LLM processes information and interacts with users or systems. It represents a behavioural layer built over the model’s core transformer architecture.
An LLM like Gemini can operate in several modes — for example,
reading and generating natural language (Text Mode),
writing code (Code Mode),
transforming text into numerical embeddings (Embeddings Mode),
understanding images and audio (Multimodal Mode),
performing structured tasks (Function Calling Mode)
reasoning over long documents (Long Context Mode).
Each mode uses the same model weights but activates different pathways of reasoning and input-output formatting.
2. Architecture Behind These Modes
To support multiple functional capabilities such as text generation, embeddings, coding assistance, and multimodal reasoning, modern large language models rely on a layered architecture built around transformer networks. Models such as Gemini are designed to handle different types of tasks through a combination of shared model components and specialized output layers. Let’s briefly explore how a model like Gemini supports multiple modes internally:
(a) Shared Transformer Backbone
At the core of the system lies a large transformer-based neural network that processes all input data. This backbone architecture enables the model to learn deep contextual relationships between tokens in a sequence.
In multimodal models like Gemini, different forms of input such as text, images, or audio are converted into numerical vectors within a unified embedding space. Each token, regardless of modality, becomes a vector representation that the transformer can process. This unified representation allows the model to perform cross-modal reasoning, linking visual, textual, and contextual information during inference.
(b) Adapters and Specialized Heads
While the transformer backbone performs the majority of the representation learning, different functional modes are supported through specialized components attached to the main model.
These components often take the form of task-specific adapters or output heads that interpret the shared representations produced by the transformer. For example:
Text head – optimized for natural language understanding and generation
Code head – trained on programming datasets such as GitHub repositories and technical documentation
Embedding head – produces dense vector representations used in search, retrieval systems, and semantic similarity tasks
Multimodal encoders – process visual or audio inputs before integrating them with textual information
This modular design allows a single large model to perform multiple tasks without requiring completely separate architectures.
(c) Function Calling and Structured Output Layer
Modern LLM APIs also support structured outputs through schema-based response formats. Instead of returning free-form text, the model can generate structured data such as JSON objects that follow a predefined format.
This capability enables language models to interact with external systems, trigger software functions, or integrate with application workflows. Through structured outputs and tool integration, LLMs can move beyond passive text generation and participate in action-oriented tasks such as database queries, API calls, and automated decision pipelines.
3. Gemini vs Traditional LLMs
Feature | Traditional LLM (GPT-3, LLaMA) | Gemini (Multimodal LLM) |
Input Types | Text only | Text, Image, Audio, Video |
Modes Supported | Text, Code | Text, Code, Embeddings, Multimodal, Function Calling, Long Context |
Context Length | Up to 32K | Up to 1 Million tokens |
Tool Use | Limited | Natively supports structured function calls |
Architecture | Text-based transformer | Multimodal transformer + cross-attention layers |
4. Developer API Structure
Modern LLM platforms expose their capabilities through developer-friendly APIs that allow applications to interact with models programmatically. The Gemini API provides a simple interface for sending prompts and receiving generated responses.
Using the Python SDK, developers can initialize a model and generate content with only a few lines of code.
from google import genai
model = genai.GenerativeModel("gemini-1.5-pro")
response = model.generate_content("Hello Gemini!")
print(response.text)In this example, the application sends a text prompt to the model and receives a generated response. The API handles the underlying processes such as tokenization, inference, and response formatting.
Different model variants within the Gemini ecosystem correspond to specific functional capabilities. Selecting the appropriate model allows developers to target different tasks and performance levels.
Some commonly used models include:
gemini-1.5-pro → optimized for advanced text generation, reasoning, code generation, and multimodal tasks
embedding-001 → designed for generating vector embeddings used in semantic search, recommendation systems, and retrieval pipelines
Through this modular API structure, developers can integrate large language models into applications such as chatbots, document analysis systems, search engines, and automated assistants.
Text and Code Modes – Reasoning, Communication, and Programming
Text processing forms the foundation of modern large language models. In this mode, the model interprets natural language input and generates meaningful responses by analyzing context, semantics, and relationships between words. For models such as Gemini, this capability powers tasks such as question answering, summarization, translation, reasoning, and creative writing.
Because most human–computer interaction occurs through language, text mode remains the primary interface through which users communicate with AI systems. These models can also maintain multi-turn conversations, allowing them to interpret follow-up questions, retain contextual information, and generate coherent responses even when prompts become complex or layered.
At the same time, modern LLMs extend this capability into code generation and programming assistance. Trained on large-scale code repositories and technical documentation, these models can understand programming syntax, explain algorithms, generate functions, and assist developers with debugging or documentation. Instead of learning a new interface, developers simply describe the task in natural language.
The following example demonstrates how developers can interact with the model using text prompts through the Python API.
from google import genai
model = genai.GenerativeModel("gemini-1.5-pro")
prompt = """
Explain reinforcement learning in simple terms.
Provide a real-world analogy and an example in Python.
"""
response = model.generate_content(prompt)
print(response.text)In this example, the model interprets multiple instructions within a single prompt. It understands the conceptual topic, generates a simplified explanation, constructs a real-world analogy, and produces a Python example. Internally, the transformer architecture uses attention mechanisms to determine which parts of the input prompt are most relevant when generating the response.
Embeddings Mode – Semantic Search and Knowledge Integration
Embeddings mode converts text or multimodal input into high-dimensional numeric vectors that represent semantic meaning. This enables similarity search, RAG (Retrieval-Augmented Generation), and clustering.
embedding_model = genai.GenerativeModel("embedding-001")
text = "Machine learning models that learn from experience."
embedding = embedding_model.embed_content(text)
print(embedding['embedding'][:10]) # Display first 10 dimensionsEach text input is mapped to a point in an n-dimensional semantic space.Two similar sentences will produce vectors that are close in distance, enabling context retrieval in chatbots or AI assistants. Use case example: RAG with gemini
Convert all your documents into embeddings
Store them in a vector database like Pinecone or FAISS
Retrieve the top relevant chunks using cosine similarity
Feed them back into Gemini for context-aware response generation
Multimodal Mode – Unified Understanding of Text, Image, and Audio
In multimodal mode, visual or audio data is first converted into vector representations. These representations are then processed alongside textual tokens within the transformer architecture. This enables the model to interpret relationships between visual elements and textual instructions, making it possible to perform tasks such as image analysis, diagram explanation, or visual question answering.
The following Python example demonstrates how an image can be provided as input alongside a textual prompt.
with open("chart.png", "rb") as f:
image_bytes = f.read()
response = model.generate_content([
{"role": "user", "parts": [
{"text": "Explain what this chart represents:"},
{"file_data": {"mime_type": "image/png", "data": image_bytes}}
]}
])
print(response.text)In this example, the model receives both a question and an image. Internally, the system uses cross-attention mechanisms to connect visual features with textual context. This allows the model to generate explanations such as interpreting axes in a chart, identifying trends in data visualizations, or describing objects present in an image.
Typical applications of multimodal LLM capabilities include:
Visual question answering
Automatic document analysis for charts, receipts, and PDFs
Educational visual tutoring systems
Content moderation and image caption generation
By integrating text, images, and other data types within a single architecture, multimodal models significantly expand the range of tasks that artificial intelligence systems can perform. This unified approach enables more natural and powerful interactions between humans and intelligent systems.
Function Calling Mode – Connecting Language Models with Real-World Actions
The Function Calling mode is one of the most powerful evolutions in modern LLMs. It allows models like Gemini to go beyond text generation and produce structured, machine-readable outputs — typically in JSON format — that can be used to trigger API calls, database queries, or system actions.
Instead of only returning text like “Sure, I’ll send an email,” the model outputs something like:
{
"function": "send_email",
"arguments": {
"to": "samultechie@gmail.com",
"subject": "Meeting Rescheduled",
"body": "The meeting is rescheduled to 4 PM."
}
}This structured reasoning capability bridges natural language understanding with programmatic control — enabling AI-driven automation systems and intelligent backend integrations.
At the architecture level, function calling is achieved through structured prompting and fine-tuning with schema-conditioned outputs.
When the model detects that an instruction can be resolved via an API or a system tool, it formats its output according to a JSON schema or tool definition you provide.
For example:
In a booking bot, functions might include book_flight, check_weather, cancel_reservation.
In a developer assistant, functions might include run_code, search_docs, or analyze_error.
This mode turns an LLM into an intelligent controller for external systems.
Gemini API Example: Function Calling
Currently, Google’s Gemini Python SDK allows structured prompting. You can instruct Gemini to output JSON objects that your Python app can parse and route to an external function.
from google import genai
import json
model = genai.GenerativeModel("gemini-1.5-pro")
prompt = """
You are a task automation assistant.
Whenever I ask to perform an action, respond with a JSON object
containing 'function' and 'arguments'.
User: Send an email to Samul saying the meeting is at 4 PM.
"""
response = model.generate_content(prompt)
try:
data = json.loads(response.text)
print(f"Function to execute: {data['function']}")
print(f"Arguments: {data['arguments']}")
except:
print("Raw output:", response.text)Explanation
The model interprets the natural language command.
It follows your instruction to return structured data.
Your backend parses this JSON and routes it to an execution handler (e.g., email sender or calendar updater).
This forms the backbone of agentic LLM systems, where models act as the “brains” and external functions as the “hands.”
Real-World Use Cases
Task automation bots: Automate calendar scheduling, emails, or Slack updates.
Conversational RPA (Robotic Process Automation): Gemini acts as the reasoning layer on top of RPA workflows.
Customer service orchestration: Dynamically call CRM APIs or ticketing systems.
Data pipelines: Execute SQL queries or API requests based on natural queries.
Example: Email Automation
def send_email(to, subject, body):
print(f"Sending Email to: {to}\nSubject: {subject}\nBody: {body}")
response_json = {
"function": "send_email",
"arguments": {
"to": "samuel@colabcodes.com",
"subject": "Project Update",
"body": "The deployment is complete and live on the server."
}
}
# Execute dynamically
if response_json["function"] == "send_email":
args = response_json["arguments"]
send_email(args["to"], args["subject"], args["body"])In a production workflow, you would parse Gemini’s JSON output and route it automatically using a dispatcher or orchestrator class.
Best Practices
Always define a clear schema and instruct Gemini to follow it strictly.
Validate the output with json.loads() before executing.
Use fallback prompts if Gemini returns free text instead of JSON.
Combine this mode with Embeddings Mode for context-aware actions.
Long Context Mode – Understanding and Reasoning Over Massive Inputs
Traditional LLMs could handle a few thousand tokens, which limited their usefulness for document-heavy tasks. Gemini 1.5 revolutionises this with up to 1 million tokens of context, allowing the model to process entire codebases, research papers, or legal documents in a single query.
This is called Long Context Mode. Long Context enables new categories of AI systems:
Research summarization (analyzing multiple academic papers)
Enterprise intelligence (reasoning over reports, contracts, or logs)
Codebase understanding (full repository analysis)
Memory-enhanced chatbots (long-term context retention)
It essentially turns an LLM into a persistent cognitive engine capable of in-depth reasoning across extensive information.
To support long context efficiently, Gemini employs:
Sparse Attention Mechanisms – the model doesn’t attend to every token but learns to focus selectively.
Segmented Memory Encoding – divides long text into sections with hierarchical summaries.
Large Language Models (LLMs) are no longer limited to generating text. They can reason, code, perceive, plan, and act. Frameworks like Gemini (by Google DeepMind) represent a new generation of multi-functional, multimodal AI systems — capable of operating in diverse modes ranging from text generation and code reasoning to function calling and autonomous agentic behavior.
Integrating LLM Modes & Real-World AI Projects with Gemini API
By now, you understand the individual functional modes of LLMs like Gemini: Text, Code, Embeddings, Multimodal, Function Calling, Long Context, and Agentic.However, real-world AI systems rarely use just one mode in isolation. The true power of LLMs emerges when multiple modes are orchestrated together.
This section explores:
Mode integration for complex workflows
RAG (Retrieval-Augmented Generation) with Gemini
Multimodal + Function Calling pipelines
Gemini vs GPT: mode capabilities and trade-offs
Mini-project examples to illustrate practical applications
Best practices for production-ready AI systems
Integrating Multiple Functional Modes
Integrating multiple functional modes allows Large Language Models (LLMs) to operate seamlessly across reasoning, generation, coding, and tool-using capabilities within a single workflow. This fusion enables dynamic task handling — for example, a model can analyze data, generate insights, and execute function calls in one coherent interaction.
Using LLMs in multi-step workflows unlocks advanced applications, such as:
AI research assistants that retrieve papers, summarize, and plan follow-ups
Customer support systems that combine text reasoning, knowledge base search, and API-driven actions
Autonomous agents that analyze multimodal data and take structured actions
Each mode acts as a modular component that can be orchestrated in a pipeline.
Example Architecture: RAG + Function Calling + Multimodal
Input: User asks a question, optionally including an image or document
Embeddings Mode: Convert text/document to vectors for semantic search
RAG: Retrieve top relevant documents or knowledge snippets
Long Context Mode: Summarize or reason over retrieved content
Function Calling Mode: Generate structured outputs for action (like sending an email or storing data)
Multimodal Mode: If the input includes an image, process it alongside text
This architecture forms a flexible, multi-modal AI system capable of reasoning, acting, and interacting intelligently.
Python Example: Multi-Mode Pipeline
from google import genai
import json
model = genai.GenerativeModel("gemini-1.5-pro")
embedding_model = genai.GenerativeModel("embedding-001")
# Step 1: Convert user query to embedding
query = "Summarize the sales trends from this chart and notify the team."
query_embedding = embedding_model.embed_content(query)
# Step 2: Retrieve relevant documents (mock retrieval)
documents = [
"Q1 Sales increased by 20%. Q2 Sales show slight decline...",
"The marketing campaign improved engagement by 15%..."
]
# Step 3: Summarize retrieved documents
context = "\n".join(documents)
summary_prompt = f"Summarize the following data and generate a JSON object for notifications:\n{context}"
summary_response = model.generate_content(summary_prompt)
# Step 4: Parse JSON and trigger function (Function Calling Mode)
try:
action_data = json.loads(summary_response.text)
print(action_data)
except:
print("Output is not JSON, raw response:", summary_response.text)This dummy pipeline is an example of how multiple Gemini modes can be chained in a real-world workflow.
Retrieval-Augmented Generation (RAG) with Gemini
RAG combines embedding-based retrieval with LLM reasoning. Gemini supports RAG pipelines by integrating:
Embeddings for vector search
Text summarization / generation
Function calls for structured outputs
Step-by-Step RAG Workflow
Document Embedding: Convert knowledge base content to vectors.
Semantic Retrieval: Use similarity metric to fetch the most relevant documents.
Contextual Generation: Feed the retrieved documents as additional context to Gemini.
Optional Function Calling: Output results in structured format for automation.
RAG Example with Gemini API
# Mock embeddings and retrieval
docs = [
"Gemini can process text, images, and code.",
"It supports function calling and agentic reasoning for complex tasks."
]
user_query = "Explain Gemini's functional modes in a structured summary."
# Step 1: Convert query to embedding
query_vector = embedding_model.embed_content(user_query)
# Step 2: Retrieve relevant document (mock retrieval)
retrieved_doc = docs[0] # Normally, similarity search selects the best match
# Step 3: Generate response with context
rag_prompt = f"""
You are an AI assistant. Use the following context to answer the user query.
Context: {retrieved_doc}
User Query: {user_query}
Provide a structured JSON summary.
"""
response = model.generate_content(rag_prompt)
print(response.text)
Multimodal + Function Calling Pipeline
Gemini allows combining images, text, and function calls in one pipeline.
Example: Chart Analysis and Notification
with open("sales_chart.png", "rb") as f:
chart_bytes = f.read()
prompt = [
{"role": "user", "parts": [
{"text": "Analyze this sales chart and suggest actions for the marketing team."},
{"file_data": {"mime_type": "image/png", "data": chart_bytes}}
]}
]
response = model.generate_content(prompt)
print(response.text)The LLM can return a JSON object specifying actionable insights, which your backend can execute.
Conclusion
Large Language Models like Gemini have evolved far beyond simple text generation. Today, they are multi-functional cognitive engines capable of understanding, reasoning, coding, perceiving, planning, and acting. By leveraging different functional modes — Text, Code, Embeddings, Multimodal, Function Calling, Long Context, and Agentic — developers can build AI systems that are not only responsive but also autonomous, context-aware, and action-driven.
The key takeaway is that each mode is a tool, and the true power of modern LLMs lies in integrating these modes to create sophisticated pipelines:
Text Mode powers reasoning and conversation.
Code Mode enables AI-assisted programming.
Embeddings Mode drives semantic search and RAG workflows.
Multimodal Mode allows LLMs to perceive and analyze images, text, and audio together.
Function Calling Mode bridges the model with real-world actions.
Long Context Mode supports in-depth reasoning over massive documents.
Agentic Mode orchestrates multi-step planning and execution.
By combining these capabilities, developers can create research assistants, workflow automation systems, intelligent chatbots, and autonomous AI agents that operate at scale and complexity previously unimaginable.
Gemini API offers a unified, developer-friendly ecosystem for harnessing all these modes — from simple text generation to complex, multi-modal, agentic pipelines. Understanding and leveraging these functional modes allows you to unlock the full potential of modern AI, bridging the gap between human-like reasoning and practical, real-world action.





