Reinforcement Learning with Gymnasium and Stable-Baselines3
Table of Contents
Introduction
Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, RL does not require labeled data. Instead, the agent discovers optimal behavior through trial and error, guided by reward signals.
This tutorial provides a hands-on guide to building RL agents using Gymnasium (the maintained fork of OpenAI Gym) for environment simulation and Stable-Baselines3 (SB3) for state-of-the-art RL algorithm implementations. You will learn to train agents with DQN, PPO, and A2C, create custom environments, monitor training, and evaluate model performance.
Prerequisites
- Python 3.9+
- Basic understanding of neural networks
- Familiarity with PyTorch (helpful but not required)
Install the required packages:
pip install gymnasium stable-baselines3[extra] tensorboard numpy matplotlib
The [extra] flag installs additional dependencies including TensorBoard support and extra Gymnasium environments.
Core RL Concepts
Before writing any code, it is essential to understand the fundamental components of an RL system.
The RL Framework
Agent <---> Environment
| |
|-- Action --> |
| |-- State, Reward -->|
|<------------------------------ ---|
- Agent: The learner and decision-maker that selects actions.
- Environment: The world the agent interacts with. It receives actions and returns states and rewards.
- State (Observation): A representation of the current situation in the environment.
- Action: A decision made by the agent that affects the environment.
- Reward: A scalar feedback signal indicating how good the action was.
- Policy: A mapping from states to actions. This is what the agent learns.
- Episode: A complete sequence from the initial state to a terminal state.
- Discount Factor (gamma): A value between 0 and 1 that determines how much future rewards are valued relative to immediate ones.
Value Functions
- V(s): The expected cumulative reward starting from state s, following policy pi.
- Q(s, a): The expected cumulative reward starting from state s, taking action a, then following policy pi.
Exploration vs. Exploitation
The agent must balance exploring new actions (to discover potentially better strategies) with exploiting known good actions (to maximize reward). This is managed through strategies like epsilon-greedy or entropy regularization.
Getting Started with Gymnasium
Gymnasium provides standardized interfaces for RL environments.
Exploring an Environment
import gymnasium as gym
import numpy as np
Create the CartPole environment
env = gym.make("CartPole-v1", rendermode="human")
print(f"Observation space: {env.observationspace}")
print(f"Action space: {env.actionspace}")
print(f"Observation shape: {env.observationspace.shape}")
print(f"Number of actions: {env.actionspace.n}")
Run a random agent
obs, info = env.reset(seed=42)
totalreward = 0
for step in range(500):
action = env.actionspace.sample() # random action