Reinforcement Learning Tutorial: Gymnasium and Stable-Baselines3

# Reinforcement Learning dengan Gymnasium dan Stable-Baselines3 ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Konsep Inti Reinforcement Learning](#konsep-inti-reinforce...

By Ruby Abdullah · · tutorial
Reinforcement LearningGymnasiumStable-Baselines3DQNPPOPython

Reinforcement Learning with Gymnasium and Stable-Baselines3

Table of Contents

  • Introduction
  • Prerequisites
  • Core RL Concepts
  • Getting Started with Gymnasium
  • Deep Q-Network (DQN)
  • Proximal Policy Optimization (PPO)
  • Advantage Actor-Critic (A2C)
  • Creating a Custom Environment
  • Training Loop and Callback Monitoring
  • Model Evaluation
  • TensorBoard Integration
  • Best Practices
  • Conclusion

  • Introduction

    Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, RL does not require labeled data. Instead, the agent discovers optimal behavior through trial and error, guided by reward signals.

    This tutorial provides a hands-on guide to building RL agents using Gymnasium (the maintained fork of OpenAI Gym) for environment simulation and Stable-Baselines3 (SB3) for state-of-the-art RL algorithm implementations. You will learn to train agents with DQN, PPO, and A2C, create custom environments, monitor training, and evaluate model performance.


    Prerequisites

    • Python 3.9+
    • Basic understanding of neural networks
    • Familiarity with PyTorch (helpful but not required)

    Install the required packages:

    pip install gymnasium stable-baselines3[extra] tensorboard numpy matplotlib
    

    The [extra] flag installs additional dependencies including TensorBoard support and extra Gymnasium environments.


    Core RL Concepts

    Before writing any code, it is essential to understand the fundamental components of an RL system.

    The RL Framework

    Agent <---> Environment
    

    | |

    |-- Action --> |

    | |-- State, Reward -->|

    |<------------------------------ ---|

    • Agent: The learner and decision-maker that selects actions.
    • Environment: The world the agent interacts with. It receives actions and returns states and rewards.
    • State (Observation): A representation of the current situation in the environment.
    • Action: A decision made by the agent that affects the environment.
    • Reward: A scalar feedback signal indicating how good the action was.
    • Policy: A mapping from states to actions. This is what the agent learns.
    • Episode: A complete sequence from the initial state to a terminal state.
    • Discount Factor (gamma): A value between 0 and 1 that determines how much future rewards are valued relative to immediate ones.

    Value Functions

    • V(s): The expected cumulative reward starting from state s, following policy pi.
    • Q(s, a): The expected cumulative reward starting from state s, taking action a, then following policy pi.

    Exploration vs. Exploitation

    The agent must balance exploring new actions (to discover potentially better strategies) with exploiting known good actions (to maximize reward). This is managed through strategies like epsilon-greedy or entropy regularization.


    Getting Started with Gymnasium

    Gymnasium provides standardized interfaces for RL environments.

    Exploring an Environment

    import gymnasium as gym
    

    import numpy as np

    Create the CartPole environment

    env = gym.make("CartPole-v1", rendermode="human")

    print(f"Observation space: {env.observationspace}")

    print(f"Action space: {env.actionspace}")

    print(f"Observation shape: {env.observationspace.shape}")

    print(f"Number of actions: {env.actionspace.n}")

    Run a random agent

    obs, info = env.reset(seed=42)

    totalreward = 0

    for step in range(500):

    action = env.actionspace.sample() # random action

    Related Articles

    Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

    Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

    ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

    ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

    SGLang Tutorial: Fast LLM Serving and Structured Generation

    SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...

    TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

    Post-Training LLM dengan TRL: SFT, Reward Modeling, dan DPO Setelah sebuah base language model selesai dipretraining, mo...