top of page
Gradient With Circle
Image by Nick Morrison

Insights Across Technology, Software, and AI

Discover articles across technology, software, and AI. From core concepts to modern tech and practical implementations.

Markov Decision Process (MDP): Theory, Mathematics, and Python Implementation

  • 4 hours ago
  • 8 min read

Many real-world AI systems must make a sequence of decisions rather than a single prediction. From autonomous vehicles and robotics to recommendation engines and game-playing agents, intelligent systems often operate in environments where actions influence future outcomes. Markov Decision Processes (MDPs) provide a mathematical framework for modeling such decision-making problems and serve as the foundation of reinforcement learning. In this article, we will explore the theory behind MDPs, examine their mathematical formulation, and implement a simple MDP in Python to understand how intelligent agents learn optimal behavior.


Markov Decision Processes (MDPs)

What is a Markov Decision Process (MDP)?

A Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems in environments where outcomes are influenced by both an agent's actions and elements of uncertainty. It provides a structured way to represent situations in which an agent interacts with its environment over time, making a sequence of decisions with the goal of maximizing cumulative rewards.


MDPs are among the most important concepts in reinforcement learning because they provide a formal mathematical model for how an intelligent agent observes its environment, takes actions, transitions between states, and receives feedback. By describing these interactions mathematically, MDPs allow researchers and practitioners to analyze, design, and optimize decision-making algorithms for complex environments.


Consider a robot navigating through a warehouse. At any moment, the robot occupies a specific location, which represents its current state. It can choose actions such as moving forward, turning left, or turning right. Depending on the action taken, the robot transitions to a new state and may receive a reward based on its progress toward a destination. Similar decision-making scenarios arise in self-driving vehicles, recommendation systems, robotics, finance, and game-playing agents.


Few of the core components of an MDP are listed below:


  1. States (S): States represent the various situations in which an agent can exist within an environment.

  2. Actions (A): Actions are the choices available to an agent at any given state.

  3. Transition Probabilities (P): Transition probabilities define the likelihood of moving from one state to another after taking an action.

  4. Reward Function (R): Rewards provide feedback to the agent and indicate the desirability of particular actions or outcomes.

  5. Discount Factor (γ): The discount factor determines the importance of future rewards relative to immediate rewards.


Mathematical Formulation of MDPs

While the components of an MDP provide a conceptual framework for modeling decision-making problems, their true power comes from the mathematical relationships that govern state transitions, rewards, and long-term outcomes. These mathematical foundations enable agents to reason about future consequences and identify actions that maximize cumulative rewards.


At the core of an MDP is the state transition function, which describes how the environment changes when an agent performs an action. The probability of transitioning from a current state (s) to a future state (s') after taking action (a) is represented as:


state transition function

This transition probability captures the dynamics of the environment and allows us to model both deterministic and stochastic systems.

Another important component is the reward function, which specifies the immediate reward received after taking an action:


reward function

In some formulations, the reward may also depend on the resulting state:


reward function with future state

The objective of an agent is not simply to maximize immediate rewards but to maximize the total rewards accumulated over time. This concept is represented by the return, which is the sum of future discounted rewards:


Cumulative Discounted Reward

1. Gt is the return at time step (t)

2. Rt+1 is the reward received in the next step

3. γ is the discount factor


This formulation allows agents to evaluate the long-term consequences of their decisions rather than focusing solely on immediate outcomes.


Policies in Markov Decision Processes

A policy defines the behavior of an agent by specifying which action should be taken in a given state. In reinforcement learning, the policy effectively represents the agent's strategy for interacting with the environment.

A policy is commonly denoted by:


policy for MDPs

which represents the probability of selecting action (a) when the agent is in state (s).

The goal of many reinforcement learning algorithms is to discover an optimal policy that maximizes cumulative rewards over time.


1. Deterministic Policies

In a deterministic policy, every state maps to exactly one action. Whenever the agent encounters a particular state, it always chooses the same action.


For example, in a navigation problem:


  • If the agent is in State A → Move Right

  • If the agent is in State B → Move Up


Deterministic policies are simple and predictable but may not perform well in highly uncertain environments.


2. Stochastic Policies

In a stochastic policy, actions are selected according to a probability distribution.

For example:

Action

Probability

Move Left

20%

Move Right

60%

Move Up

20%

Stochastic policies are often useful when exploration is important or when the environment contains uncertainty.


Value Functions for Markov Decision Process (MDPs)

While policies describe what actions an agent should take, value functions measure how beneficial states and actions are in terms of expected future rewards.

Value functions help an agent determine which decisions are likely to produce the greatest long-term return.


1. State Value Function

The state value function estimates the expected cumulative reward obtained by starting from a particular state and following a given policy.

It is represented as:


State Value Function

A higher state value indicates that the state is more desirable because it is expected to lead to greater future rewards.

While the return Gt is the actual accumulated reward from a single sequence of actions, the environment and policy can be stochastic (random). Therefore, an agent cannot know the exact future return from just a single trajectory.


2. Action Value Function

The action value function, often called the Q-function, estimates the expected return of taking a specific action in a given state and then following a policy thereafter.

It is represented as:


Action-Value Function, or Q-function

The Q-function evaluates the "quality" of an action. It calculates the expected total discounted reward an agent will receive if it starts in state s, executes action a immediately (even if that action contradicts its current policy), and then strictly follows policy pi for all subsequent steps.

Action value functions form the foundation of many reinforcement learning algorithms, including Q-Learning and Deep Q Networks (DQNs).

Together, state value functions and action value functions enable agents to evaluate alternatives and improve their decision-making strategies.


The Bellman Equation

The Bellman Equation is one of the most important mathematical concepts in reinforcement learning and dynamic programming. Proposed by mathematician Richard Bellman, it provides a recursive relationship for calculating the value of a state.

The key insight behind the Bellman Equation is that the value of a state can be decomposed into two parts:


  1. The immediate reward received.

  2. The value of future states.


This recursive relationship allows complex decision-making problems to be broken down into smaller subproblems.


The Bellman Expectation Equation for a state value function is:


Bellman Expectation Equation

This equation states that the value of a state equals the expected immediate reward plus the discounted value of future states.


The Bellman Optimality Equation shifts the focus from evaluating a policy to finding the best possible policy. The optimal state-value function, denoted as V*(s), represents the maximum expected return achievable from state s by any policy.


Bellman Optimality Equation

These equations form the mathematical foundation of many reinforcement learning algorithms and provide the mechanism through which agents learn optimal behavior.


Solving Markov Decision Processes

The ultimate objective of solving an MDP is to find an optimal policy that maximizes cumulative rewards. Several algorithms have been developed to achieve this objective.


1. Policy Evaluation

Policy Evaluation computes the value function associated with a given policy.

Given a fixed policy, the algorithm repeatedly applies the Bellman Expectation Equation until the state values converge.

The result is an estimate of how good each state is under the current policy.


2. Policy Improvement

Once state values have been estimated, the policy can be improved by selecting actions that lead to higher-value states.

This process updates the agent's behavior and produces a better policy.


3. Policy Iteration

Policy Iteration combines Policy Evaluation and Policy Improvement into a two-step iterative process:


  1. Evaluate the current policy.

  2. Improve the policy using updated value estimates.


These steps are repeated until the policy no longer changes, indicating that an optimal policy has been found.

Policy Iteration is guaranteed to converge to an optimal solution for finite MDPs.


4. Value Iteration

Value Iteration is another popular method for solving MDPs.

Instead of separately evaluating and improving policies, Value Iteration directly updates state values using the Bellman Optimality Equation:


Bellman Optimality Equation in value iteration

As the values converge, the optimal policy can be extracted by selecting actions that maximize expected future rewards.


Value Iteration is often simpler to implement than Policy Iteration and is widely used in reinforcement learning and planning problems.


Together, these solution methods provide powerful tools for finding optimal strategies in environments modeled by Markov Decision Processes.


Extensions of Markov Decision Process (MDPs)

While Markov Decision Processes provide a powerful framework for modeling sequential decision-making problems, many real-world environments are far more complex than the assumptions made by standard MDPs. In practice, agents may have incomplete information about their surroundings, operate in continuous environments, or interact with other intelligent agents. To address these challenges, researchers have developed several extensions of the traditional MDP framework.

These extensions enable more realistic and scalable modeling of complex decision-making scenarios and form the foundation of many advanced reinforcement learning systems.


1. Partially Observable Markov Decision Processes (POMDPs)

A standard MDP assumes that the agent has complete knowledge of the current state of the environment. However, in many real-world situations, the true state may not be directly observable. For example:


  • A self-driving car may have limited visibility due to fog or heavy rain.

  • A robot's sensors may provide noisy or incomplete measurements.

  • A medical diagnosis system may not have access to all relevant patient information.


Partially Observable Markov Decision Processes (POMDPs) extend MDPs by allowing agents to make decisions based on observations rather than directly observed states. Since the true state is unknown, the agent maintains a belief state, which is a probability distribution over possible states of the environment.

POMDPs are widely used in robotics, autonomous systems, healthcare, and navigation problems where uncertainty and incomplete information are common.


2. Continuous State and Action MDPs

Traditional MDPs often assume that states and actions belong to finite, discrete sets. While this assumption simplifies computation, many real-world systems operate in continuous spaces. Examples include:


  • The position and velocity of a robot.

  • The steering angle and acceleration of a vehicle.

  • Financial portfolio allocation decisions.

  • Control systems in industrial automation.


In Continuous State and Action MDPs, states and actions can take any value within a continuous range rather than being restricted to predefined categories. These environments are significantly more challenging because the number of possible states and actions can be effectively infinite.

Modern reinforcement learning algorithms such as policy gradient methods, actor-critic models, and deep reinforcement learning techniques are often designed to handle continuous decision spaces.


3. Hierarchical MDPs

As decision-making problems become more complex, solving them directly using a single MDP can become computationally expensive. Hierarchical MDPs address this challenge by decomposing large tasks into smaller, more manageable sub-tasks.

For example, a household robot may have a high-level goal of preparing a meal. This objective can be broken down into smaller tasks such as:


  • Gathering ingredients.

  • Preparing cooking utensils.

  • Cooking the food.

  • Serving the meal.


Each sub-task can be represented as its own decision-making problem, allowing the agent to solve complex objectives more efficiently.

Hierarchical approaches improve scalability, encourage reusable behaviors, and often accelerate learning in large environments.


4. Multi-Agent Decision Processes

Many environments contain multiple agents that interact with one another rather than a single decision-maker. Examples include:


  • Autonomous vehicles sharing a road network.

  • Multiple robots collaborating in a warehouse.

  • Financial trading systems competing in a market.

  • Players interacting in multiplayer games.


Multi-Agent Decision Processes extend traditional MDPs by incorporating multiple agents whose actions influence the environment and each other's outcomes.

These systems introduce additional challenges because each agent must consider not only the environment but also the behavior of other agents. Depending on the problem, agents may cooperate, compete, or exhibit a combination of both behaviors.

Multi-agent reinforcement learning has become an active research area with applications in robotics, game theory, distributed systems, and autonomous coordination.

Get in touch for customized mentorship, research and freelance solutions tailored to your needs.

bottom of page