Ray Train & Ray Tune: Distributed Training and Hyperparameter Tuning
Most machine learning projects start on a single machine, and at some point the training loop either runs out of memory, takes too long on one GPU, or needs hundreds of hyperparameter trials that a single process cannot finish in time. Ray Train and Ray Tune address those two problems directly: the first scales an existing training loop across multiple workers and GPUs, and the second runs many training runs in parallel to search a hyperparameter space. This tutorial walks through both libraries with a coherent PyTorch example, then shows how to combine them so you can tune the hyperparameters of a distributed training job without rewriting your code.
A Short Introduction to Ray and the AI Libraries
Ray is a framework for distributed Python. At its core it provides tasks and actors that run across a cluster, plus a scheduler that places work on available CPUs and GPUs. On top of that core, the Ray AI Libraries provide higher-level building blocks for common machine learning needs:
- Ray Data for distributed data loading and preprocessing.
- Ray Train for distributed model training.
- Ray Tune for hyperparameter search and experiment orchestration.
- Ray Serve for online model serving (covered in a separate tutorial, so we only mention it here for context).
The important property is that the same code runs on your laptop and on a multi-node cluster. You develop locally, then point the same script at a cluster when you need more compute. This tutorial focuses on Ray Train and Ray Tune, which together cover the training and tuning stages of a typical workflow.
Installation
Install Ray with the Train and Tune extras, plus PyTorch:
pip install "ray[train,tune]" torch torchvision
Verify the installation:
python -c "import ray; from ray import train, tune; print(ray.version)"
The examples below were written against Ray 2.x. The API for Ray Train was reworked in the 2.x series, so make sure you are not on a pre-2.7 release where the imports differ.
Ray Train: From Single-Node to Distributed
The starting point: a plain PyTorch loop
Consider a typical single-process training loop for an image classifier on FashionMNIST. There is nothing distributed about it yet.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
def buildmodel():
return nn.Sequential(
nn.Flatten(),
nn.Linear(28 28, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10),
)
def trainsinglenode(epochs=5, lr=1e-3, batchsize=64):
transform = transforms.ToTensor()
traindata = datasets.FashionMNIST(
root="./data", train=True, download=True, transform=transform
)
loader = DataLoader(traindata, batchsize=batchsize, shuffle=True)
device = "cuda" if torch.cuda.isavailable() else "cpu"
model = buildmodel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
lossfn = nn.CrossEntropyLoss()
for epoch in range(epochs):
model.train()
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimizer.zerograd()
loss = lossfn(model(images), labels)
loss.backward()
optimizer.step()
print(f"epoch {epoch}: last batch loss {loss.item():.4f}")
This works, but it uses one process and one device. To train across several GPUs or several nodes with data-parallelism, we adapt the loop for Ray Train.
Converting the loop with TorchTrainer
Ray Train wraps your training function and runs one copy of it per worker. Each worker handles a shard of the data and synchronizes gradients automatically. The changes are small and localized: