Tutorial Lengkap Ray Serve: Scalable ML Model Serving
Ray Serve adalah library model serving yang scalable dibangun di atas Ray. Library ini memungkinkan Anda menyajikan model ML dengan scaling otomatis, batching, dan komposisi multi-model, menjadikannya ideal untuk deployment ML production.
Mengapa Ray Serve?
Keunggulan Ray Serve:- Framework agnostic: Bekerja dengan framework ML apapun
- Scalable: Scaling otomatis berdasarkan load
- Composable: Kombinasikan multiple models dengan mudah
- Batching: Request batching otomatis
- Native Python: API Python-first yang simple
- Model serving skala besar
- Multi-model pipelines
- A/B testing
- Real-time inference
- Batch inference
Instalasi
pip install "ray[serve]"
Verify instalasi
python -c "import ray; from ray import serve; print(ray.version)"
Quick Start
1. Basic Deployment
from ray import serve
import ray
ray.init()
serve.start()
@serve.deployment
class ModelDeployment:
def init(self):
self.model = "simplemodel"
def call(self, request):
return {"message": f"Diproses oleh {self.model}"}
Deploy
ModelDeployment.deploy()
Test
import requests
response = requests.get("http://localhost:8000/ModelDeployment")
print(response.json())
2. Dengan FastAPI
from ray import serve
from fastapi import FastAPI
import ray
app = FastAPI()
@serve.deployment
@serve.ingress(app)
class MLService:
def init(self):
self.model = self.loadmodel()
def loadmodel(self):
return "mymodel"
@app.get("/predict")
def predict(self, text: str):
return {"prediction": f"Hasil untuk: {text}"}
@app.get("/health")
def health(self):
return {"status": "healthy"}
ray.init()
serve.run(MLService.bind())
3. Serve Model ML
from ray import serve
import ray
import pickle
import numpy as np
@serve.deployment
class SklearnModel:
def init(self, modelpath: str):
with open(modelpath, "rb") as f:
self.model = pickle.load(f)
async def call(self, request):
data = await request.json()
features = np.array(data["features"]).reshape(1, -1)
prediction = self.model.predict(features)
return {"prediction": prediction.tolist()}
ray.init()
serve.run(SklearnModel.bind(modelpath="model.pkl"))
Konfigurasi Deployment
1. Alokasi Resource
from ray import serve
@serve.deployment(
numreplicas=3,
rayactoroptions={
"numcpus": 2,
"numgpus": 1,
"memory": 4 1024 1024 1024 # 4GB
}
)
class GPUModel:
def init(self):
import torch
self.device = torch.device("cuda")
self.model = self.loadmodel()
def loadmodel(self):
import torch
model = torch.nn.Linear(10, 2)
return model.to(self.device)
async def call(self, request):
import torch
data = await request.json()
tensor = torch.tensor(data["input"]).to(self.device)
output = self.model(tensor)
return {"output": output.cpu().tolist()}
2. Autoscaling
from ray import serve
from ray.serve.config import AutoscalingConfig
@serve.deployment(
autoscalingconfig=AutoscalingConfig(
minreplicas=1,
maxreplicas=10,
targetnumongoingrequestsperreplica=5,
upscaledelays=10,
downscaledelays=30
)
)
class AutoscaledModel:
def init(self):
self.model = "autoscaledmodel"