Tutorial Lengkap Kubeflow: MLOps di Kubernetes
Kubeflow adalah platform open-source untuk deploy, mengelola, dan scaling workflow machine learning di Kubernetes. Platform ini menyediakan solusi MLOps lengkap dengan pipelines, model serving, notebooks, dan experiment tracking.
Mengapa Kubeflow?
Keunggulan Kubeflow:- Kubernetes native: Manfaatkan skalabilitas dan reliabilitas K8s
- End-to-end MLOps: Dari eksperimen hingga production
- Portable: Jalankan di cluster Kubernetes manapun
- Composable: Gunakan hanya komponen yang diperlukan
- Open source: Komunitas aktif dan ekosistem luas
- Orkestrasi ML pipeline
- Distributed training
- Model serving skala besar
- Experiment tracking
- Feature engineering
Instalasi
1. Prerequisites
# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl && sudo mv kubectl /usr/local/bin/
Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/installkustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
2. Install Kubeflow
# Clone manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
Install dengan kustomize
while ! kustomize build example | kubectl apply -f -; do
echo "Mencoba ulang..."
sleep 10
done
Cek instalasi
kubectl get pods -n kubeflow
3. Akses Dashboard
# Port forward
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Akses di http://localhost:8080
Kredensial default: user@example.com / 12341234
Kubeflow Pipelines
1. Basic Pipeline
from kfp import dsl
from kfp import compiler
@dsl.component
def preprocessdata(datapath: str) -> str:
import pandas as pd
df = pd.readcsv(datapath)
df = df.dropna()
outputpath = "/tmp/preprocessed.csv"
df.tocsv(outputpath, index=False)
return outputpath
@dsl.component
def trainmodel(datapath: str, epochs: int) -> str:
import pickle
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
df = pd.readcsv(datapath)
X = df.drop("target", axis=1)
y = df["target"]
model = RandomForestClassifier(nestimators=100)
model.fit(X, y)
modelpath = "/tmp/model.pkl"
with open(modelpath, "wb") as f:
pickle.dump(model, f)
return modelpath
@dsl.component
def evaluatemodel(modelpath: str, testdata: str) -> float:
import pickle
import pandas as pd
from sklearn.metrics import accuracyscore
with open(modelpath, "rb") as f:
model = pickle.load(f)
df = pd.readcsv(testdata)
X = df.drop("target", axis=1)
y = df["target"]
predictions = model.predict(X)
accuracy = accuracyscore(y, predictions)
return accuracy
@dsl.pipeline(name="ML Training Pipeline")
def mlpipeline(datapath: str, epochs: int = 10):
preprocesstask = preprocessdata(datapath=datapath)
traintask = trainmodel(
datapath=preprocesstask.output,
epochs=epochs
)
evaluatetask = evaluatemodel(
modelpath=traintask.output,
testdata=preprocesstask.output
)
Compile pipeline
compiler.Compiler().compile(mlpipeline, "pipeline.yaml")
2. Jalankan Pipeline
from kfp.client import Client
Koneksi ke Kubeflow
client = Client(host="http://localhost:8080/pipeline")
Buat experiment
experiment = client.createexperiment("my-experiment")
Jalankan pipeline
run = client.runpipeline(
experimentid=experiment.id,
jobname="training-run-1",
pipelinepackagepath="pipeline.yaml",
params={"datapath": "gs://bucket/data.csv", "epochs": 20}
)