SkyPilot: Running ML and AI Workloads on Any Cloud
Training and serving machine learning models often means fighting two battles at once: finding GPUs that are actually available, and avoiding surprise cloud bills. SkyPilot is an open-source framework that sits above AWS, GCP, Azure, Kubernetes, and your own machines, letting you describe a job once and run it wherever it is cheapest and available. This tutorial walks through the concepts, the YAML format, the CLI, managed spot jobs, and serving — with practical examples you can adapt.
What SkyPilot Is and the Problems It Solves
SkyPilot is a layer that provisions and manages compute on whichever backend you choose, then runs your workload on it. You write a declarative description of what you need (an accelerator, some CPUs, a setup script, a run command) and SkyPilot figures out where to place it.
It addresses a handful of recurring pain points in ML infrastructure:
- GPU scarcity. A specific GPU type is frequently out of capacity in one region while available in another. SkyPilot searches across regions and clouds to find capacity instead of leaving you to retry manually.
- Cost. The same accelerator can vary widely in price between providers and regions. SkyPilot's optimizer compares options and picks the cheapest one that satisfies your constraints.
- Multi-cloud lock-in. Teams often standardize on one provider not because it is best, but because rewriting infrastructure is painful. SkyPilot uses the same task definition everywhere, so moving a job from AWS to GCP is a flag change, not a rewrite.
- Manual cluster management. Spinning up VMs, installing drivers, syncing code, and remembering to tear everything down is tedious and error-prone. SkyPilot automates provisioning, code sync, and — importantly — auto-stopping idle clusters so you stop paying for forgotten machines.
It is not a replacement for your cloud account; it uses your existing credentials and your own quota. Think of it as an orchestration layer rather than a managed hosting service.
How It Works
At a high level, SkyPilot does the following when you launch a task:
setup commands once, then your run command.The cluster is a normal set of VMs (or Kubernetes pods) in your account. You can SSH into it, run more commands on it, stop it, or tear it down at any time.
Installation and Cloud Setup
SkyPilot installs with pip. You select which backends to enable through extras:
pip install "skypilot[aws,gcp,kubernetes]"
Add or remove extras to match your environment — for example skypilot[azure], skypilot[lambda], or skypilot[all]. It is good practice to install into a dedicated virtual environment:
python -m venv .venv
source .venv/bin/activate
pip install "skypilot[aws,gcp,kubernetes]"
Next, configure credentials for each cloud you intend to use. SkyPilot relies on each provider's standard credential mechanism, so if your CLI already works, SkyPilot usually does too.
# AWS: configure access keys or SSO
aws configure
GCP: authenticate and set a project
gcloud auth login
gcloud auth application-default login
gcloud config set project YOURPROJECTID
Kubernetes: point SkyPilot at a working kubeconfig context
kubectl config current-context
Finally, verify what SkyPilot can see with sky check:
sky check
This reports each cloud as enabled or disabled and explains what is missing if a cloud is not ready (for example, missing credentials or insufficient permissions). Only enabled clouds participate in the optimizer.