Reinforcement Learning with Python: A Practical, Hands-On Guide
- Dec 21, 2023
- 10 min read
Updated: Apr 21
Reinforcement learning is a powerful machine learning approach where an agent learns to make decisions through interaction and feedback, rather than labeled data. By taking actions and optimizing for rewards, it gradually improves its strategy over time.
In this blog, we explore reinforcement learning in Python through a hands-on approach. We build a simple GridWorld environment, implement the Q-learning algorithm, and train an agent step by step. Along the way, we cover key concepts like exploration vs exploitation and reward-based learning.
By the end, we will have a complete working example that shows how an agent learns the optimal path through experience.

What is Reinforcement Learning?
Reinforcement learning is a branch of Artificial Intelligence in which the model learns by continuous interactions with its environments. Let's say I want to get a cookie from a jar that's on a tall shelf. There isn't one right way to get the cookies. Maybe I will find a ladder or build a complicated system of pulleys. These could all be brilliant or terrible ideas but if something works I get the sweet taste of victory and I learned that doing the same thing could get me another cookie in the future. We learn lots of things by trial and error and this kind of learning is called reinforcement learning.
As a basic general principle reinforcements are provided to the model based on the impact of a particular action in relation to the target. If the action made by the model is takes it closer to the set target, a positive reinforcement is provided, and a negative reinforcement otherwise Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment to achieve a specific goal. It learns through trial and error by receiving feedback(reinforcements) in the form of rewards(positive reinforcement) or penalties(negative reinforcement) based on its actions.
The key to reinforce learning is just trial and error over and over again. For humans a reward might be a cookie or the joy of winning a board game. But for an AI system a reward is just a small positive signal that basically tells it, good job and do that again. The fundamental components of reinforcement learning are:
Agent: This is our model, the learner or decision-maker that interacts with the environment in order to learn from it and takes actions based on a policy to maximize cumulative rewards.
Environment: The external system with which the agent interacts. It responds to the actions of the agent and provides feedback in the form of rewards or penalties.
Actions: This represents the choices available to the agent at each step of interaction with the environment.
Rewards: Numeric signals provided by the environment to indicate the desirability of the agent's actions. Positive rewards encourage the agent to repeat similar actions, while negative rewards or penalties discourage undesired behaviour.
We don't pause to think after every action the agent ends up interacting with its environment for a while whether that's a game board, a virtual maze or real life kitchen and the agent takes many actions until it gets a reward which we give out. When it wins a game or gets that cookie jar from that really tall shelf then every time the agent wins or succeeds at his task we look back on the actions it took and figure out what game states were helpful and which ones weren't. During this reflection we're assigning value to those different game states and deciding on a policy for which actions work best. We need values and policies to get anything done in reinforcement learning.
Basic Steps in Reinforcement Learning
An agent makes predictions or performs actions like moving a tiny bit forward or picking the next best move in a game and it performs actions based on its current inputs which we call the states. In supervised learning, after each action we would have a training label that tells our AI whether it did the right thing or not. We can't do that with reinforcement learning because we don't know what the right thing actually is until it's completely done with the task. This difference actually highlights one of the hardest parts of reinforcement learning called credit assignment. It's hard to know which actions helped us get the reward and should get the credit and which action slowed down our AI. The reinforcement learning process involves:
1. Exploration vs Exploitation
In the early stages, the agent doesn’t know which actions are good or bad, so it explores different possibilities by trying random or less certain actions. This helps it gather information about the environment. As learning progresses, the agent starts exploiting this knowledge by choosing actions that have previously produced higher rewards. A good balance between exploration and exploitation is crucial. Too much exploration slows learning, while too much exploitation can make the agent miss better strategies.
2. Learning from Feedback
Reinforcement learning is driven by feedback in the form of rewards or penalties. After taking an action, the agent observes the result and receives a reward signal that indicates how good that action was. Over time, it updates its decision-making strategy, known as a policy, to favor actions that lead to higher cumulative rewards. This continuous loop of action, feedback, and adjustment is what enables the agent to improve its performance.
3. Temporal Credit Assignment
In many real-world scenarios, rewards are not immediate. An action taken now might only produce a benefit much later. Temporal credit assignment is the process of figuring out which past actions contributed to future rewards. The agent learns to assign credit (or blame) across a sequence of actions, allowing it to understand long-term consequences rather than just immediate outcomes. This is what makes reinforcement learning powerful for solving complex, multi-step problems.
Python Implementation of Reinforcement Learning Using Q-Learning
In this hands-on tutorial, we'll walk you through building a simple GridWorld environment from scratch as in reinforcement learning in Python. You'll train an agent to navigate a grid and reach a goal using Q-learning. Best of all, the entire project runs right in Google Colab, making it easy to follow, run, and visualize. By the end of this tutorial, you’ll have:
Built a basic GridWorld environment
Implemented the Q-learning algorithm from scratch
Visualized the agent’s training progress and final learned path
Gained a practical understanding of how RL agents learn by trial and error
This tutorial is perfect for beginners looking to understand reinforcement learning concepts through coding—not just theory. Let's dive in and teach an AI agent how to solve a maze, one step at a time!
What is Q-Learning?
Q-learning is a model-free reinforcement learning algorithm. It learns the value of taking a given action in a given state by using the Q-value (short for “quality”) and updates this value through interaction with the environment.
The algorithm uses the Bellman equation to update Q-values. For example, consider a robot navigating a grid to reach a goal. When the robot moves from state s to s’ by taking action a, it receives a reward r. The Q-value for (s, a) is updated using the formula:
Q(s,a) = Q(s,a) + α [r + γ max(Q(s',a')) - Q(s,a)]
Here,
α (learning rate) controls how much new information overrides old values
γ (discount factor) determines the importance of future rewards.
r: reward from taking action a in state s
s′: new state
a′: next best action
If the agent finds a path that yields a high reward, the Q-values along that path are reinforced. Exploration is often managed using strategies like ε-greedy, where the agent randomly explores with probability ε and exploits the best-known action otherwise.
Step 1: Define the GridWorld Environment
To understand reinforcement learning in action, we begin by creating a simple GridWorld environment. This is a 4×4 grid where an agent learns to move from a starting position at (0, 0) to a goal at (3, 3) by interacting with its surroundings.
At each step, the agent can move up, down, left, or right, while staying within the grid boundaries. The objective is to reach the goal as efficiently as possible. To encourage this behavior, the agent receives a reward of +1 upon reaching the goal and a small penalty of -0.01 for every other step, discouraging unnecessary movements.
The environment is implemented as a Python class. The reset() method initializes the agent’s position, while the step() method updates its state based on the selected action and returns the new position, reward, and completion status. This interaction loop forms the foundation for training a reinforcement learning agent.
import numpy as np
import random
import matplotlib.pyplot as plt
from matplotlib import colors
# GridWorld definition
class GridWorld:
def __init__(self, size=4):
self.size = size
self.start = (0, 0)
self.goal = (size - 1, size - 1)
self.reset()
def reset(self):
self.agent_pos = self.start
return self.agent_pos
def step(self, action):
x, y = self.agent_pos
if action == 0 and x > 0: x -= 1 # up
elif action == 1 and x < self.size - 1: x += 1 # down
elif action == 2 and y > 0: y -= 1 # left
elif action == 3 and y < self.size - 1: y += 1 # right
self.agent_pos = (x, y)
reward = 1 if self.agent_pos == self.goal else -0.01
done = self.agent_pos == self.goal
return self.agent_pos, reward, doneStep 2: Q-Learning Algorithm & Training Loop
With the environment in place, the next step is to train the agent using the Q-learning algorithm. At the core of this process is the Q-table, which stores the expected reward for each state-action pair. Since our grid is 4×4 and the agent has four possible actions, the Q-table is initialized with a shape of 4×4×4, starting with all values set to zero.
To ensure the agent learns effectively, we use an epsilon-greedy strategy. This means the agent sometimes explores by choosing random actions and other times exploits its current knowledge by selecting the action with the highest Q-value. The balance between these two behaviors is controlled by the epsilon parameter.
The training process runs over multiple episodes. In each episode, the agent starts from the initial position and continues taking actions until it reaches the goal. After every action, the Q-table is updated using the Q-learning formula, which adjusts the value of the current state-action pair based on the reward received and the estimated future rewards.
# Q-learning setup
q_table = np.zeros((env.size, env.size, 4)) # 4 actions
episodes = 500
alpha = 0.1
gamma = 0.9
epsilon = 0.2
rewards = []
env = GridWorld()
# Train the agent
for ep in range(episodes):
state = env.reset()
total_reward = 0
done = False
while not done:
x, y = state
if random.random() < epsilon:
action = random.randint(0, 3)
else:
action = np.argmax(q_table[x, y])
next_state, reward, done = env.step(action)
nx, ny = next_state
old_value = q_table[x, y, action]
future_value = np.max(q_table[nx, ny])
q_table[x, y, action] = old_value + alpha * (reward + gamma * future_value - old_value)
state = next_state
total_reward += reward
rewards.append(total_reward)Over time, this iterative process allows the agent to learn the optimal path to the goal by maximizing cumulative rewards. The total reward for each episode is also recorded, making it easier to track how the agent’s performance improves during training.
Step 3: Visualize Training Progress
Once the training process is complete, it’s important to evaluate how well the agent has learned over time. One of the simplest ways to do this is by visualizing the total reward collected in each episode. This helps in understanding if the agent is improving and converging toward an optimal strategy.
plt.figure(figsize=(10, 4))
plt.plot(rewards)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("Training Progress")
plt.grid(True)
plt.show()By plotting rewards against episodes, we can observe learning trends. In the early stages, the rewards are usually lower due to random exploration. As training progresses, the agent begins to make better decisions, leading to higher and more consistent rewards. A steadily increasing curve generally indicates that the learning process is working as expected.

Step 4: Visualize the Learned Path
After training, the next step is to see what the agent has actually learned. Instead of staring at Q-values like they’re going to reveal their secrets, we extract the agent’s policy and trace the path it follows from the start to the goal.
The get_policy_path() function does exactly this. Starting from the initial position, the agent repeatedly selects the action with the highest Q-value at each state. This represents the best decision it has learned during training. The function continues until the agent reaches the goal or revisits a state, preventing it from getting stuck in loops.
Once the path is generated, we visualize it on the grid. The visualize_grid_path() function highlights the cells visited by the agent, along with clearly marking the start and goal positions. This provides an intuitive view of the learned behavior, making it easy to verify if the agent has discovered the shortest and most efficient route.
def get_policy_path():
env.reset()
state = env.agent_pos
path = [state]
visited = set()
for _ in range(50): # max steps
x, y = state
if (x, y) == env.goal or (x, y) in visited:
break
visited.add((x, y))
action = np.argmax(q_table[x, y])
next_state, _, _ = env.step(action)
path.append(next_state)
state = next_state
return path
path = get_policy_path()
def visualize_grid_path(path):
grid = np.zeros((env.size, env.size))
for (x, y) in path:
grid[x][y] = 0.5
sx, sy = env.start
gx, gy = env.goal
grid[sx][sy] = 0.7
grid[gx][gy] = 1.0
cmap = colors.ListedColormap(['white', 'lightblue', 'orange'])
bounds = [0, 0.49, 0.69, 1.0]
norm = colors.BoundaryNorm(bounds, cmap.N)
fig, ax = plt.subplots()
ax.imshow(grid, cmap=cmap, norm=norm)
for x in range(env.size):
for y in range(env.size):
if (x, y) == env.start:
text = 'Start'
elif (x, y) == env.goal:
text = 'Goal'
elif (x, y) in path:
text = '→'
else:
text = ''
ax.text(y, x, text, ha='center', va='center', color='black')
ax.set_xticks(np.arange(-0.5, env.size, 1), minor=True)
ax.set_yticks(np.arange(-0.5, env.size, 1), minor=True)
ax.grid(which='minor', color='gray', linestyle='-', linewidth=1)
ax.tick_params(which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
ax.set_title("Learned Policy Path from Start to Goal")
plt.show()
visualize_grid_path(path)A well-trained agent should produce a direct and minimal path from start to goal, showing that it has successfully learned to maximize rewards while minimizing unnecessary steps.

Conclusion
Reinforcement Learning is a powerful and intuitive framework that teaches agents to make decisions through interaction and experience. In this tutorial, we explored how Q-learning, one of the simplest yet foundational RL algorithms, works by enabling an agent to navigate a grid-based world using trial and error.
Through building a basic GridWorld environment, training a Q-learning agent, and visualizing the results, you've seen how agents can learn optimal behaviors without prior knowledge—only through rewards and feedback. This hands-on approach not only reinforces theoretical concepts but also shows how easily reinforcement learning can be applied to real-world scenarios, from robotics to game AI.
As you continue your journey, remember that this is just the beginning. More advanced algorithms like Deep Q-Networks (DQN), Policy Gradients, and Actor-Critic methods build on these same ideas. But at its core, reinforcement learning remains about learning from consequences—and empowering machines to learn how to learn.





