Reinforcement Learning with Gymnasium and Stable-Baselines3

Introduction

Prerequisites

Core RL Concepts

Getting Started with Gymnasium

Deep Q-Network (DQN)

Proximal Policy Optimization (PPO)

Advantage Actor-Critic (A2C)

Creating a Custom Environment

Training Loop and Callback Monitoring

Model Evaluation

TensorBoard Integration

Best Practices

Conclusion

Introduction

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, RL does not require labeled data. Instead, the agent discovers optimal behavior through trial and error, guided by reward signals.

This tutorial provides a hands-on guide to building RL agents using Gymnasium (the maintained fork of OpenAI Gym) for environment simulation and Stable-Baselines3 (SB3) for state-of-the-art RL algorithm implementations. You will learn to train agents with DQN, PPO, and A2C, create custom environments, monitor training, and evaluate model performance.

Prerequisites

Python 3.9+
Basic understanding of neural networks
Familiarity with PyTorch (helpful but not required)

Install the required packages:

pip install gymnasium stable-baselines3[extra] tensorboard numpy matplotlib

The [extra] flag installs additional dependencies including TensorBoard support and extra Gymnasium environments.

Core RL Concepts

Before writing any code, it is essential to understand the fundamental components of an RL system.

The RL Framework

Agent <---> Environment
  |              |
  |-- Action --> |
  |              |-- State, Reward -->|
  |<------------------------------ ---|

Agent: The learner and decision-maker that selects actions.
Environment: The world the agent interacts with. It receives actions and returns states and rewards.
State (Observation): A representation of the current situation in the environment.
Action: A decision made by the agent that affects the environment.
Reward: A scalar feedback signal indicating how good the action was.
Policy: A mapping from states to actions. This is what the agent learns.
Episode: A complete sequence from the initial state to a terminal state.
Discount Factor (gamma): A value between 0 and 1 that determines how much future rewards are valued relative to immediate ones.

Value Functions

V(s): The expected cumulative reward starting from state s, following policy pi.
Q(s, a): The expected cumulative reward starting from state s, taking action a, then following policy pi.

Exploration vs. Exploitation

The agent must balance exploring new actions (to discover potentially better strategies) with exploiting known good actions (to maximize reward). This is managed through strategies like epsilon-greedy or entropy regularization.

Getting Started with Gymnasium

Gymnasium provides standardized interfaces for RL environments.

Exploring an Environment

import gymnasium as gym
import numpy as np

Create the CartPole environment
env = gym.make("CartPole-v1", rendermode="human")


print(f"Observation space: {env.observationspace}")
print(f"Action space: {env.actionspace}")

print(f"Observation shape: {env.observationspace.shape}")
print(f"Number of actions: {env.actionspace.n}")


Run a random agent
obs, info = env.reset(seed=42)
totalreward = 0

for step in range(500):
    action = env.actionspace.sample()  # random action

Reinforcement Learning Tutorial: Gymnasium and Stable-Baselines3

Reinforcement Learning with Gymnasium and Stable-Baselines3

Table of Contents

Introduction

Prerequisites

Core RL Concepts

The RL Framework

Value Functions

Exploration vs. Exploitation

Getting Started with Gymnasium

Exploring an Environment

Create the CartPole environment

Run a random agent

Related Articles

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

SGLang Tutorial: Fast LLM Serving and Structured Generation

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Related Articles

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Post-Training LLM dengan TRL: SFT, Reward Modeling, dan DPO Setelah sebuah base language model selesai dipretraining, mo...