Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

# Ray Train & Ray Tune: Pelatihan Terdistribusi dan Penyetelan Hiperparameter Sebagian besar proyek machine learning dimulai pada satu mesin, dan pada titik tertentu loop pelatihan kehabisan memori,...

By Ruby Abdullah · · tutorial
RayDistributed TrainingHyperparameter TuningPyTorchMLOpsPython

Ray Train & Ray Tune: Distributed Training and Hyperparameter Tuning

Most machine learning projects start on a single machine, and at some point the training loop either runs out of memory, takes too long on one GPU, or needs hundreds of hyperparameter trials that a single process cannot finish in time. Ray Train and Ray Tune address those two problems directly: the first scales an existing training loop across multiple workers and GPUs, and the second runs many training runs in parallel to search a hyperparameter space. This tutorial walks through both libraries with a coherent PyTorch example, then shows how to combine them so you can tune the hyperparameters of a distributed training job without rewriting your code.

A Short Introduction to Ray and the AI Libraries

Ray is a framework for distributed Python. At its core it provides tasks and actors that run across a cluster, plus a scheduler that places work on available CPUs and GPUs. On top of that core, the Ray AI Libraries provide higher-level building blocks for common machine learning needs:

  • Ray Data for distributed data loading and preprocessing.
  • Ray Train for distributed model training.
  • Ray Tune for hyperparameter search and experiment orchestration.
  • Ray Serve for online model serving (covered in a separate tutorial, so we only mention it here for context).

The important property is that the same code runs on your laptop and on a multi-node cluster. You develop locally, then point the same script at a cluster when you need more compute. This tutorial focuses on Ray Train and Ray Tune, which together cover the training and tuning stages of a typical workflow.

Installation

Install Ray with the Train and Tune extras, plus PyTorch:

pip install "ray[train,tune]" torch torchvision

Verify the installation:

python -c "import ray; from ray import train, tune; print(ray.version)"

The examples below were written against Ray 2.x. The API for Ray Train was reworked in the 2.x series, so make sure you are not on a pre-2.7 release where the imports differ.

Ray Train: From Single-Node to Distributed

The starting point: a plain PyTorch loop

Consider a typical single-process training loop for an image classifier on FashionMNIST. There is nothing distributed about it yet.

import torch

import torch.nn as nn

from torch.utils.data import DataLoader

from torchvision import datasets, transforms

def buildmodel():

return nn.Sequential(

nn.Flatten(),

nn.Linear(28 28, 256),

nn.ReLU(),

nn.Linear(256, 128),

nn.ReLU(),

nn.Linear(128, 10),

)

def trainsinglenode(epochs=5, lr=1e-3, batchsize=64):

transform = transforms.ToTensor()

traindata = datasets.FashionMNIST(

root="./data", train=True, download=True, transform=transform

)

loader = DataLoader(traindata, batchsize=batchsize, shuffle=True)

device = "cuda" if torch.cuda.isavailable() else "cpu"

model = buildmodel().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

lossfn = nn.CrossEntropyLoss()

for epoch in range(epochs):

model.train()

for images, labels in loader:

images, labels = images.to(device), labels.to(device)

optimizer.zerograd()

loss = lossfn(model(images), labels)

loss.backward()

optimizer.step()

print(f"epoch {epoch}: last batch loss {loss.item():.4f}")

This works, but it uses one process and one device. To train across several GPUs or several nodes with data-parallelism, we adapt the loop for Ray Train.

Converting the loop with TorchTrainer

Ray Train wraps your training function and runs one copy of it per worker. Each worker handles a shard of the data and synchronizes gradients automatically. The changes are small and localized:

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic Pendahuluan Membangun model machine learning yang akurat...

Apache Kafka for Real-Time ML Tutorial: Streaming Data Pipeline

Tutorial 13: Apache Kafka untuk Pipeline ML Real-Time Daftar Isi Pendahuluan Prasyarat Memahami Apache Kafka [Menyiapkan...

Complete Vertex AI Tutorial: Google Cloud Unified ML Platform

Tutorial Lengkap Vertex AI: Platform ML Terpadu di Google Cloud Vertex AI adalah platform machine learning terpadu Googl...