Machine Learning: What is Reinforcement Learning?

Samul Black
Dec 21, 2023
8 min read

Updated: Jul 16

In this article we will go through a basic introduction to a sub field of Machine Learning know as reinforcement learning. Reinforcement Learning (RL) is a type of machine learning where an agent learns by interacting with its environment and receiving rewards or penalties based on its actions. Unlike supervised learning, RL doesn’t rely on labeled data—instead, it learns through trial and error to make better decisions over time. In this tutorial, we’ll cover the basics of RL, its key components, and how it’s used in real-world applications like game-playing, robotics, and self-driving systems.

Reinforcement learning in machine learning

What is Reinforcement Learning?

Reinforcement learning is a branch of Artificial Intelligence in which the model learns by continuous interactions with its environments. Let's say I want to get a cookie from a jar that's on a tall shelf. There isn't one right way to get the cookies. Maybe I will find a ladder or build a complicated system of pulleys. These could all be brilliant or terrible ideas but if something works I get the sweet taste of victory and I learned that doing the same thing could get me another cookie in the future. We learn lots of things by trial and error and this kind of learning is called reinforcement learning.

As a basic general principle reinforcements are provided to the model based on the impact of a particular action in relation to the target. If the action made by the model is takes it closer to the set target, a positive reinforcement is provided, and a negative reinforcement otherwise Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment to achieve a specific goal. It learns through trial and error by receiving feedback(reinforcements) in the form of rewards(positive reinforcement) or penalties(negative reinforcement) based on its actions.

The key to reinforce learning is just trial and error over and over again. For humans a reward might be a cookie or the joy of winning a board game. But for an AI system a reward is just a small positive signal that basically tells it, good job and do that again. The fundamental components of reinforcement learning are:

Agent: This is our model, the learner or decision-maker that interacts with the environment in order to learn from it and takes actions based on a policy to maximize cumulative rewards.
Environment: The external system with which the agent interacts. It responds to the actions of the agent and provides feedback in the form of rewards or penalties.
Actions: This represents the choices available to the agent at each step of interaction with the environment.
Rewards: Numeric signals provided by the environment to indicate the desirability of the agent's actions. Positive rewards encourage the agent to repeat similar actions, while negative rewards or penalties discourage undesired behaviour.

We don't pause to think after every action the agent ends up interacting with its environment for a while whether that's a game board, a virtual maze or real life kitchen and the agent takes many actions until it gets a reward which we give out. When it wins a game or gets that cookie jar from that really tall shelf then every time the agent wins or succeeds at his task we look back on the actions it took and figure out what game states were helpful and which ones weren't. During this reflection we're assigning value to those different game states and deciding on a policy for which actions work best. We need values and policies to get anything done in reinforcement learning.

Basic Steps in Reinforcement Learning

An agent makes predictions or performs actions like moving a tiny bit forward or picking the next best move in a game and it performs actions based on its current inputs which we call the states. In supervised learning, after each action we would have a training label that tells our AI whether it did the right thing or not. We can't do that with reinforcement learning because we don't know what the right thing actually is until it's completely done with the task. This difference actually highlights one of the hardest parts of reinforcement learning called credit assignment. It's hard to know which actions helped us get the reward and should get the credit and which action slowed down our AI. The reinforcement learning process involves:

Exploration and Exploitation: The agent explores the environment initially to understand the consequences of its actions and gradually exploits this knowledge to maximize rewards.
Learning from Feedback: Through continuous interaction, the agent learns to associate actions with outcomes by adjusting its policy to optimize long-term rewards.
Temporal Credit Assignment: The agent attributes credit (or blame) to its actions concerning the obtained rewards, taking into account delayed consequences.

Reinforcement Learning Industry uses

Google’S Deep Mind got some pretty impressive results when they used reinforcement learning to teach virtual AI systems to walk, jump and even duck under obstacles. It looks kind of silly but works pretty well. Reinforcement learning finds applications in various domains:

Game Playing: Teaching agents to play games like chess, Go, or video games.
Robotics: Training robots to perform tasks by interacting with the physical world.
Autonomous Vehicles: Enabling vehicles to learn driving strategies in simulated environments.
Recommendation Systems: Optimizing recommendations based on user interactions.

Popular reinforcement learning algorithms include Q-Learning, Deep Q Networks (DQN), Policy Gradient methods, and Actor-Critic methods. These algorithms, coupled with the exploration-exploitation trade-off, enable agents to learn and make decisions in complex and dynamic environments.

List of Reinforcement Learning Algorithms

In Machine Learning Reinforcement Learning (RL) falls somewhere between supervised and unsupervised. It is not supervised learning since it doesn't absolutely rely on a set of labeled training data. At the same time it can not classified as unsupervised learning since we're looking for our reinforcement learning agent in order to maximise a reward. For the agent to attain its main goal, it must determine the correct set of actions to be taken in different scenarios. Following are the list of various reinforcement learning techniques:

Markov decision process (MDP)
Bellman equation
Dynamic programming
Value iteration
Policy iteration
Q-learning.

Python Implementation of Reinforcement Learning Using Q-Learning

In this hands-on tutorial, we'll walk you through building a simple GridWorld environment from scratch using Python. You'll train an agent to navigate a grid and reach a goal using Q-learning. Best of all, the entire project runs right in Google Colab, making it easy to follow, run, and visualize. By the end of this tutorial, you’ll have:

Built a basic GridWorld environment
Implemented the Q-learning algorithm from scratch
Visualized the agent’s training progress and final learned path
Gained a practical understanding of how RL agents learn by trial and error

This tutorial is perfect for beginners looking to understand reinforcement learning concepts through coding—not just theory. Let's dive in and teach an AI agent how to solve a maze, one step at a time!

What is Q-Learning?

Q-learning is a model-free reinforcement learning algorithm. It learns the value of taking a given action in a given state by using the Q-value (short for “quality”) and updates this value through interaction with the environment.

The algorithm uses the Bellman equation to update Q-values. For example, consider a robot navigating a grid to reach a goal. When the robot moves from state s to s’ by taking action a, it receives a reward r. The Q-value for (s, a) is updated using the formula:

Q(s,a) = Q(s,a) + α [r + γ max(Q(s',a')) - Q(s,a)]

Here,

α (learning rate) controls how much new information overrides old values

γ (discount factor) determines the importance of future rewards.

r: reward from taking action a in state s

s′: new state

a′: next best action

If the agent finds a path that yields a high reward, the Q-values along that path are reinforced. Exploration is often managed using strategies like ε-greedy, where the agent randomly explores with probability ε and exploits the best-known action otherwise.

Step 1: Define the GridWorld Environment

We’ll create a 4x4 grid where:

The agent starts at (0,0)
The goal is at (3,3)
The agent can move in 4 directions: up, down, left, right
Rewards are:
- +1 for reaching the goal
- -0.01 for every step (to encourage shorter paths)

import numpy as np
import random
import matplotlib.pyplot as plt
from matplotlib import colors

# GridWorld definition
class GridWorld:
    def __init__(self, size=4):
        self.size = size
        self.start = (0, 0)
        self.goal = (size - 1, size - 1)
        self.reset()

    def reset(self):
        self.agent_pos = self.start
        return self.agent_pos

    def step(self, action):
        x, y = self.agent_pos
        if action == 0 and x > 0: x -= 1         # up
        elif action == 1 and x < self.size - 1: x += 1  # down
        elif action == 2 and y > 0: y -= 1       # left
        elif action == 3 and y < self.size - 1: y += 1  # right
        self.agent_pos = (x, y)
        reward = 1 if self.agent_pos == self.goal else -0.01
        done = self.agent_pos == self.goal
        return self.agent_pos, reward, done

Step 2: Q-Learning Algorithm & Training Loop

We'll define:

A Q-table with shape [states x actions] → [4x4x4]
An epsilon-greedy strategy for balancing exploration and exploitation
A training loop over multiple episodes

# Q-learning setup
q_table = np.zeros((env.size, env.size, 4))  # 4 actions
episodes = 500
alpha = 0.1
gamma = 0.9
epsilon = 0.2
rewards = []
env = GridWorld()

# Train the agent
for ep in range(episodes):
    state = env.reset()
    total_reward = 0
    done = False

    while not done:
        x, y = state
        if random.random() < epsilon:
            action = random.randint(0, 3)
        else:
            action = np.argmax(q_table[x, y])

        next_state, reward, done = env.step(action)
        nx, ny = next_state
        old_value = q_table[x, y, action]
        future_value = np.max(q_table[nx, ny])
        q_table[x, y, action] = old_value + alpha * (reward + gamma * future_value - old_value)

        state = next_state
        total_reward += reward

    rewards.append(total_reward)

Step 3: Visualize Training Progress

plt.figure(figsize=(10, 4))
plt.plot(rewards)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("Training Progress")
plt.grid(True)
plt.show()

Output:

learning graph for q learning training process

Step 4: Visualize the Learned Path

Let’s extract the agent’s final policy and show the shortest path from start to goal using the learned Q-values.

def get_policy_path():
    env.reset()
    state = env.agent_pos
    path = [state]
    visited = set()
    for _ in range(50):  # max steps
        x, y = state
        if (x, y) == env.goal or (x, y) in visited:
            break
        visited.add((x, y))
        action = np.argmax(q_table[x, y])
        next_state, _, _ = env.step(action)
        path.append(next_state)
        state = next_state
    return path

path = get_policy_path()

def visualize_grid_path(path):
    grid = np.zeros((env.size, env.size))
    for (x, y) in path:
        grid[x][y] = 0.5
    sx, sy = env.start
    gx, gy = env.goal
    grid[sx][sy] = 0.7
    grid[gx][gy] = 1.0

    cmap = colors.ListedColormap(['white', 'lightblue', 'orange'])
    bounds = [0, 0.49, 0.69, 1.0]
    norm = colors.BoundaryNorm(bounds, cmap.N)

    fig, ax = plt.subplots()
    ax.imshow(grid, cmap=cmap, norm=norm)

    for x in range(env.size):
        for y in range(env.size):
            if (x, y) == env.start:
                text = 'Start'
            elif (x, y) == env.goal:
                text = 'Goal'
            elif (x, y) in path:
                text = '→'
            else:
                text = ''
            ax.text(y, x, text, ha='center', va='center', color='black')

    ax.set_xticks(np.arange(-0.5, env.size, 1), minor=True)
    ax.set_yticks(np.arange(-0.5, env.size, 1), minor=True)
    ax.grid(which='minor', color='gray', linestyle='-', linewidth=1)
    ax.tick_params(which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
    ax.set_title("Learned Policy Path from Start to Goal")
    plt.show()

visualize_grid_path(path)

Output:

Conclusion: Reinforcement Learning in Action

Reinforcement Learning is a powerful and intuitive framework that teaches agents to make decisions through interaction and experience. In this tutorial, we explored how Q-learning, one of the simplest yet foundational RL algorithms, works by enabling an agent to navigate a grid-based world using trial and error.

Through building a basic GridWorld environment, training a Q-learning agent, and visualizing the results, you've seen how agents can learn optimal behaviors without prior knowledge—only through rewards and feedback. This hands-on approach not only reinforces theoretical concepts but also shows how easily reinforcement learning can be applied to real-world scenarios, from robotics to game AI.

As you continue your journey, remember that this is just the beginning. More advanced algorithms like Deep Q-Networks (DQN), Policy Gradients, and Actor-Critic methods build on these same ideas. But at its core, reinforcement learning remains about learning from consequences—and empowering machines to learn how to learn.