Modal: Serverless GPU Cloud for ML Model Deployment
One of the biggest challenges in machine learning isn't building the model, it's deploying it to production. You need to manage servers, configure GPU drivers, set up auto-scaling, and much more. Modal exists to eliminate all that complexity.
Modal is a serverless cloud platform that lets you run Python code in the cloud with GPU access, without managing any infrastructure. Just write a regular Python function, add a decorator, and Modal handles the rest.
Why Modal?
Comparison with the traditional approach:
Without Modal (Traditional):- Set up EC2/GCE server with GPU
- Install CUDA, cuDNN, drivers
- Configure Docker, Kubernetes
- Set up load balancer, auto-scaling
- Monitoring, logging, alerting
- Total setup time: days to weeks
- Write a Python function
- Add the
@app.function(gpu="A100")decorator - Deploy with
modal deploy - Total time: minutes
Modal offers pay-per-second pricing, meaning you only pay when your code is actually running. No idle costs.
Installation and Authentication
Install Modal
pip install modal
Setup Authentication
modal setup
This command will open your browser for login and automatically configure the API token. After completion, you can start using Modal from the terminal right away.
Verify the installation:
modal --version
Basic Functions with @app.function
The core concept of Modal is simple: you define regular Python functions and mark them with decorators to run in the cloud.
import modal
app = modal.App("hello-modal")
@app.function()
def hello(name: str) -> str:
return f"Hello, {name}! This is running on Modal cloud."
@app.localentrypoint()
def main():
# Call the function running in the cloud
result = hello.remote("World")
print(result)
Run it with:
modal run helloapp.py
The hello function executes in the Modal cloud, not on your local machine. The @app.localentrypoint() decorator marks the function that runs locally as the entry point.
GPU Selection
Modal's key strength is easy GPU access. Just specify the GPU type you want:
import modal
app = modal.App("gpu-example")
Using NVIDIA T4 (budget-friendly, inference)
@app.function(gpu="T4")
def inferencet4():
import torch
print(f"GPU available: {torch.cuda.isavailable()}")
print(f"GPU name: {torch.cuda.getdevicename(0)}")
return "T4 inference complete"
Using NVIDIA A10G (balanced performance/cost)
@app.function(gpu="A10G")
def traina10g():
import torch
print(f"GPU: {torch.cuda.getdevicename(0)}")
print(f"Memory: {torch.cuda.getdeviceproperties(0).totalmem / 1e9:.1f} GB")
return "A10G training complete"
Using NVIDIA A100 (high-performance training)
@app.function(gpu="A100")
def traina100():
import torch
print(f"GPU: {torch.cuda.getdevicename(0)}")
return "A100 training complete"
Using NVIDIA H100 (cutting-edge, LLM training)
@app.function(gpu="H100")
def trainh100():
import torch
print(f"GPU: {torch.cuda.getdevicename(0)}")
return "H100 training complete"
Multiple GPUs
@app.function(gpu=modal.gpu.A100(count=2))
def multigputraining():
import torch
print(f"Number of GPUs: {torch.cuda.devicecount()}")
return "Multi-GPU training complete"
Choose GPUs based on your needs:
- T4: Inference, small-medium models, most cost-effective
- A10G: Balance between performance and cost, fine-tuning
- A100: Large model training, LLM fine-tuning
- H100: Highest performance, large-scale LLM training
Container Images (modal.Image)
Modal uses containers to run your code. You can customize the environment with modal.Image: