Serverless Model Serving on Kubernetes with KServe
KServe is a Kubernetes-native platform for serving machine learning models, built around a single custom resource called the InferenceService. If you already run workloads on Kubernetes and want model serving that scales to zero, supports canary rollouts, and exposes a standard inference protocol, KServe fits into your existing cluster instead of asking you to adopt a separate stack. This tutorial walks through the architecture, installation, and the day-to-day patterns you will use in production.
What KServe Is
KServe (formerly KFServing) is a CNCF model inference platform that runs on Kubernetes. Its core idea is to hide the boilerplate of building a serving deployment behind a declarative CRD. Instead of writing a Deployment, Service, HorizontalPodAutoscaler, and an ingress route by hand, you describe one InferenceService and KServe reconciles everything else.
A few characteristics distinguish KServe from a plain serving container:
- It is protocol-aware. KServe standardizes prediction endpoints around the V1 data plane and the Open Inference Protocol (V2), so clients written against the spec work across different model frameworks.
- It is runtime-aware. Built-in
ServingRuntimedefinitions cover common frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow, Triton, MLServer, Hugging Face) so you only point at a model artifact. - It is composable. A request can flow through a transformer (pre/post-processing) and then a predictor, optionally with an explainer attached.
Unlike a general-purpose serving framework that you self-host, KServe is a control plane: it leans on Kubernetes primitives and, in serverless mode, on Knative and Istio.
Relationship to Knative and Istio
KServe supports two deployment modes, and the difference comes down to which dependencies you install.
Serverless mode
In serverless mode, KServe uses Knative Serving to manage the lifecycle of predictor pods and Istio (or another Knative-supported networking layer) for ingress and traffic routing. Knative gives you two things that are hard to build yourself: request-driven autoscaling down to zero replicas, and revision-based traffic splitting that powers canary rollouts. This is the default and most feature-complete mode.
Standard (raw) mode
Standard mode (sometimes called "raw Kubernetes" deployment) skips Knative and Istio. KServe creates plain Kubernetes Deployment, Service, and HorizontalPodAutoscaler objects directly. You lose scale-to-zero and revision-based canary, but you also drop two large dependencies. Standard mode is a good fit when your platform team does not want Istio in the cluster, or when steady traffic means scale-to-zero offers little benefit.
You select the mode per service with an annotation:
metadata:
annotations:
serving.kserve.io/deploymentMode: "RawDeployment" # or "Serverless"
Prerequisites
You need:
- A Kubernetes cluster, version 1.28 or newer. A local
kindorminikubecluster works for learning; a managed cluster (GKE, EKS, AKS) works for production. kubectlconfigured to talk to that cluster.- Cluster-admin permissions to install CRDs and webhooks.
- For serverless mode, enough headroom for Knative and Istio control-plane pods (roughly 2 vCPU and 4 GB of free capacity beyond your workloads).
Confirm your context before installing:
kubectl config current-context
kubectl get nodes
kubectl version --short
Installation
Quickstart script (development)
For a development cluster, the quickstart script installs KServe together with Knative, Istio, and cert-manager in one step. Always read a script before piping it to a shell.
# Inspect first
curl -sL "https://raw.githubusercontent.com/kserve/kserve/release-0.14/hack/quickinstall.sh" -o quickinstall.sh
less quickinstall.sh
Then run
bash quickinstall.sh
When it finishes, verify the control plane is healthy:
kubectl get pods -n kserve
kubectl get pods -n knative-serving
kubectl get pods -n istio-system