Ray Train & Ray Tune: Distributed Training and Hyperparameter Tuning

Most machine learning projects start on a single machine, and at some point the training loop either runs out of memory, takes too long on one GPU, or needs hundreds of hyperparameter trials that a single process cannot finish in time. Ray Train and Ray Tune address those two problems directly: the first scales an existing training loop across multiple workers and GPUs, and the second runs many training runs in parallel to search a hyperparameter space. This tutorial walks through both libraries with a coherent PyTorch example, then shows how to combine them so you can tune the hyperparameters of a distributed training job without rewriting your code.

A Short Introduction to Ray and the AI Libraries

Ray is a framework for distributed Python. At its core it provides tasks and actors that run across a cluster, plus a scheduler that places work on available CPUs and GPUs. On top of that core, the Ray AI Libraries provide higher-level building blocks for common machine learning needs:

Ray Data for distributed data loading and preprocessing.
Ray Train for distributed model training.
Ray Tune for hyperparameter search and experiment orchestration.
Ray Serve for online model serving (covered in a separate tutorial, so we only mention it here for context).

The important property is that the same code runs on your laptop and on a multi-node cluster. You develop locally, then point the same script at a cluster when you need more compute. This tutorial focuses on Ray Train and Ray Tune, which together cover the training and tuning stages of a typical workflow.

Installation

Install Ray with the Train and Tune extras, plus PyTorch:

pip install "ray[train,tune]" torch torchvision

Verify the installation:

python -c "import ray; from ray import train, tune; print(ray.version)"

The examples below were written against Ray 2.x. The API for Ray Train was reworked in the 2.x series, so make sure you are not on a pre-2.7 release where the imports differ.

Ray Train: From Single-Node to Distributed

The starting point: a plain PyTorch loop

Consider a typical single-process training loop for an image classifier on FashionMNIST. There is nothing distributed about it yet.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms


def buildmodel():

    return nn.Sequential(
        nn.Flatten(),
        nn.Linear(28  28, 256),

        nn.ReLU(),
        nn.Linear(256, 128),
        nn.ReLU(),
        nn.Linear(128, 10),
    )


def trainsinglenode(epochs=5, lr=1e-3, batchsize=64):
    transform = transforms.ToTensor()
    traindata = datasets.FashionMNIST(

        root="./data", train=True, download=True, transform=transform
    )
    loader = DataLoader(traindata, batchsize=batchsize, shuffle=True)

    device = "cuda" if torch.cuda.isavailable() else "cpu"

    model = buildmodel().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    lossfn = nn.CrossEntropyLoss()


    for epoch in range(epochs):
        model.train()
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zerograd()
            loss = lossfn(model(images), labels)

            loss.backward()
            optimizer.step()
        print(f"epoch {epoch}: last batch loss {loss.item():.4f}")

This works, but it uses one process and one device. To train across several GPUs or several nodes with data-parallelism, we adapt the loop for Ray Train.

Converting the loop with TorchTrainer

Ray Train wraps your training function and runs one copy of it per worker. Each worker handles a shard of the data and synchronizes gradients automatically. The changes are small and localized:

Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

Ray Train & Ray Tune: Distributed Training and Hyperparameter Tuning

A Short Introduction to Ray and the AI Libraries

Installation

Ray Train: From Single-Node to Distributed

The starting point: a plain PyTorch loop

Converting the loop with TorchTrainer

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

Apache Kafka for Real-Time ML Tutorial: Streaming Data Pipeline

Complete Vertex AI Tutorial: Google Cloud Unified ML Platform

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic Pendahuluan Membangun model machine learning yang akurat...

Apache Kafka for Real-Time ML Tutorial: Streaming Data Pipeline

Tutorial 13: Apache Kafka untuk Pipeline ML Real-Time Daftar Isi Pendahuluan Prasyarat Memahami Apache Kafka [Menyiapkan...

Complete Vertex AI Tutorial: Google Cloud Unified ML Platform

Tutorial Lengkap Vertex AI: Platform ML Terpadu di Google Cloud Vertex AI adalah platform machine learning terpadu Googl...