Modal: Serverless GPU Cloud for ML Model Deployment

One of the biggest challenges in machine learning isn't building the model, it's deploying it to production. You need to manage servers, configure GPU drivers, set up auto-scaling, and much more. Modal exists to eliminate all that complexity.

Modal is a serverless cloud platform that lets you run Python code in the cloud with GPU access, without managing any infrastructure. Just write a regular Python function, add a decorator, and Modal handles the rest.

Comparison with the traditional approach:

Without Modal (Traditional):

Set up EC2/GCE server with GPU
Install CUDA, cuDNN, drivers
Configure Docker, Kubernetes
Set up load balancer, auto-scaling
Monitoring, logging, alerting
Total setup time: days to weeks

With Modal:

Write a Python function
Add the @app.function(gpu="A100") decorator
Deploy with modal deploy
Total time: minutes

Modal offers pay-per-second pricing, meaning you only pay when your code is actually running. No idle costs.

Installation and Authentication

pip install modal

Setup Authentication

modal setup

This command will open your browser for login and automatically configure the API token. After completion, you can start using Modal from the terminal right away.

Verify the installation:

modal --version

Basic Functions with @app.function

The core concept of Modal is simple: you define regular Python functions and mark them with decorators to run in the cloud.

import modal

app = modal.App("hello-modal")

@app.function()
def hello(name: str) -> str:
    return f"Hello, {name}! This is running on Modal cloud."

@app.localentrypoint()

def main():
    # Call the function running in the cloud
    result = hello.remote("World")
    print(result)

Run it with:

modal run helloapp.py

The hello function executes in the Modal cloud, not on your local machine. The @app.localentrypoint() decorator marks the function that runs locally as the entry point.

GPU Selection

Modal's key strength is easy GPU access. Just specify the GPU type you want:

import modal

app = modal.App("gpu-example")

Using NVIDIA T4 (budget-friendly, inference)
@app.function(gpu="T4")
def inferencet4():
    import torch
    print(f"GPU available: {torch.cuda.isavailable()}")

    print(f"GPU name: {torch.cuda.getdevicename(0)}")

    return "T4 inference complete"

Using NVIDIA A10G (balanced performance/cost)
@app.function(gpu="A10G")
def traina10g():
    import torch
    print(f"GPU: {torch.cuda.getdevicename(0)}")
    print(f"Memory: {torch.cuda.getdeviceproperties(0).totalmem / 1e9:.1f} GB")

    return "A10G training complete"

Using NVIDIA A100 (high-performance training)
@app.function(gpu="A100")
def traina100():
    import torch
    print(f"GPU: {torch.cuda.getdevicename(0)}")
    return "A100 training complete"

Using NVIDIA H100 (cutting-edge, LLM training)
@app.function(gpu="H100")
def trainh100():

    import torch
    print(f"GPU: {torch.cuda.getdevicename(0)}")

    return "H100 training complete"

Multiple GPUs
@app.function(gpu=modal.gpu.A100(count=2))
def multigputraining():

    import torch
    print(f"Number of GPUs: {torch.cuda.devicecount()}")
    return "Multi-GPU training complete"

Choose GPUs based on your needs:

T4: Inference, small-medium models, most cost-effective
A10G: Balance between performance and cost, fine-tuning
A100: Large model training, LLM fine-tuning
H100: Highest performance, large-scale LLM training

Container Images (modal.Image)

Modal uses containers to run your code. You can customize the environment with modal.Image:

Modal: Serverless GPU Cloud for ML Model Deployment

Modal: Serverless GPU Cloud for ML Model Deployment

Installation and Authentication

Setup Authentication

Basic Functions with @app.function

GPU Selection

Using NVIDIA T4 (budget-friendly, inference)

Using NVIDIA A10G (balanced performance/cost)

Using NVIDIA A100 (high-performance training)

Using NVIDIA H100 (cutting-edge, LLM training)

Multiple GPUs

Container Images (modal.Image)

Related Articles

Complete vLLM Tutorial: High-Performance LLM Serving

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

SGLang Tutorial: Fast LLM Serving and Structured Generation

Related Articles

Complete vLLM Tutorial: High-Performance LLM Serving

Tutorial Lengkap vLLM: High-Performance LLM Serving vLLM adalah library Python untuk inference dan serving LLM dengan pe...

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...

Modal: Serverless GPU Cloud for ML Model Deployment

Why Modal?

Installation and Authentication

Install Modal

Setup Authentication

Basic Functions with @app.function

GPU Selection

Using NVIDIA T4 (budget-friendly, inference)

Using NVIDIA A10G (balanced performance/cost)

Using NVIDIA A100 (high-performance training)

Using NVIDIA H100 (cutting-edge, LLM training)

Multiple GPUs

Container Images (modal.Image)

Related Articles

Complete vLLM Tutorial: High-Performance LLM Serving

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

SGLang Tutorial: Fast LLM Serving and Structured Generation