Modal: Serverless GPU Cloud for ML Model Deployment

# Modal: Serverless GPU Cloud untuk Deploy Model ML Salah satu tantangan terbesar dalam machine learning bukan membuat model, melainkan **men-deploy**-nya ke production. Anda perlu mengelola server,...

By Ruby Abdullah · · tutorial
ModalServerlessGPUCloud ComputingPython

Modal: Serverless GPU Cloud for ML Model Deployment

One of the biggest challenges in machine learning isn't building the model, it's deploying it to production. You need to manage servers, configure GPU drivers, set up auto-scaling, and much more. Modal exists to eliminate all that complexity.

Modal is a serverless cloud platform that lets you run Python code in the cloud with GPU access, without managing any infrastructure. Just write a regular Python function, add a decorator, and Modal handles the rest.

Why Modal?

Comparison with the traditional approach:

Without Modal (Traditional):
  • Set up EC2/GCE server with GPU
  • Install CUDA, cuDNN, drivers
  • Configure Docker, Kubernetes
  • Set up load balancer, auto-scaling
  • Monitoring, logging, alerting
  • Total setup time: days to weeks

With Modal:
  • Write a Python function
  • Add the @app.function(gpu="A100") decorator
  • Deploy with modal deploy
  • Total time: minutes

Modal offers pay-per-second pricing, meaning you only pay when your code is actually running. No idle costs.

Installation and Authentication

Install Modal

pip install modal

Setup Authentication

modal setup

This command will open your browser for login and automatically configure the API token. After completion, you can start using Modal from the terminal right away.

Verify the installation:

modal --version

Basic Functions with @app.function

The core concept of Modal is simple: you define regular Python functions and mark them with decorators to run in the cloud.

import modal

app = modal.App("hello-modal")

@app.function()

def hello(name: str) -> str:

return f"Hello, {name}! This is running on Modal cloud."

@app.localentrypoint()

def main():

# Call the function running in the cloud

result = hello.remote("World")

print(result)

Run it with:

modal run helloapp.py

The hello function executes in the Modal cloud, not on your local machine. The @app.localentrypoint() decorator marks the function that runs locally as the entry point.

GPU Selection

Modal's key strength is easy GPU access. Just specify the GPU type you want:

import modal

app = modal.App("gpu-example")

Using NVIDIA T4 (budget-friendly, inference)

@app.function(gpu="T4")

def inferencet4():

import torch

print(f"GPU available: {torch.cuda.isavailable()}")

print(f"GPU name: {torch.cuda.getdevicename(0)}")

return "T4 inference complete"

Using NVIDIA A10G (balanced performance/cost)

@app.function(gpu="A10G")

def traina10g():

import torch

print(f"GPU: {torch.cuda.getdevicename(0)}")

print(f"Memory: {torch.cuda.getdeviceproperties(0).totalmem / 1e9:.1f} GB")

return "A10G training complete"

Using NVIDIA A100 (high-performance training)

@app.function(gpu="A100")

def traina100():

import torch

print(f"GPU: {torch.cuda.getdevicename(0)}")

return "A100 training complete"

Using NVIDIA H100 (cutting-edge, LLM training)

@app.function(gpu="H100")

def trainh100():

import torch

print(f"GPU: {torch.cuda.getdevicename(0)}")

return "H100 training complete"

Multiple GPUs

@app.function(gpu=modal.gpu.A100(count=2))

def multigputraining():

import torch

print(f"Number of GPUs: {torch.cuda.devicecount()}")

return "Multi-GPU training complete"

Choose GPUs based on your needs:

  • T4: Inference, small-medium models, most cost-effective
  • A10G: Balance between performance and cost, fine-tuning
  • A100: Large model training, LLM fine-tuning
  • H100: Highest performance, large-scale LLM training

Container Images (modal.Image)

Modal uses containers to run your code. You can customize the environment with modal.Image:

Related Articles

Complete vLLM Tutorial: High-Performance LLM Serving

Tutorial Lengkap vLLM: High-Performance LLM Serving vLLM adalah library Python untuk inference dan serving LLM dengan pe...

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...