Serverless Model Serving on Kubernetes with KServe

KServe is a Kubernetes-native platform for serving machine learning models, built around a single custom resource called the InferenceService. If you already run workloads on Kubernetes and want model serving that scales to zero, supports canary rollouts, and exposes a standard inference protocol, KServe fits into your existing cluster instead of asking you to adopt a separate stack. This tutorial walks through the architecture, installation, and the day-to-day patterns you will use in production.

What KServe Is

KServe (formerly KFServing) is a CNCF model inference platform that runs on Kubernetes. Its core idea is to hide the boilerplate of building a serving deployment behind a declarative CRD. Instead of writing a Deployment, Service, HorizontalPodAutoscaler, and an ingress route by hand, you describe one InferenceService and KServe reconciles everything else.

A few characteristics distinguish KServe from a plain serving container:

It is protocol-aware. KServe standardizes prediction endpoints around the V1 data plane and the Open Inference Protocol (V2), so clients written against the spec work across different model frameworks.
It is runtime-aware. Built-in ServingRuntime definitions cover common frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow, Triton, MLServer, Hugging Face) so you only point at a model artifact.
It is composable. A request can flow through a transformer (pre/post-processing) and then a predictor, optionally with an explainer attached.

Unlike a general-purpose serving framework that you self-host, KServe is a control plane: it leans on Kubernetes primitives and, in serverless mode, on Knative and Istio.

Relationship to Knative and Istio

KServe supports two deployment modes, and the difference comes down to which dependencies you install.

Serverless mode

In serverless mode, KServe uses Knative Serving to manage the lifecycle of predictor pods and Istio (or another Knative-supported networking layer) for ingress and traffic routing. Knative gives you two things that are hard to build yourself: request-driven autoscaling down to zero replicas, and revision-based traffic splitting that powers canary rollouts. This is the default and most feature-complete mode.

Standard (raw) mode

Standard mode (sometimes called "raw Kubernetes" deployment) skips Knative and Istio. KServe creates plain Kubernetes Deployment, Service, and HorizontalPodAutoscaler objects directly. You lose scale-to-zero and revision-based canary, but you also drop two large dependencies. Standard mode is a good fit when your platform team does not want Istio in the cluster, or when steady traffic means scale-to-zero offers little benefit.

You select the mode per service with an annotation:

metadata: annotations: serving.kserve.io/deploymentMode: "RawDeployment" # or "Serverless"

Prerequisites

You need:

A Kubernetes cluster, version 1.28 or newer. A local kind or minikube cluster works for learning; a managed cluster (GKE, EKS, AKS) works for production.
kubectl configured to talk to that cluster.
Cluster-admin permissions to install CRDs and webhooks.
For serverless mode, enough headroom for Knative and Istio control-plane pods (roughly 2 vCPU and 4 GB of free capacity beyond your workloads).

Confirm your context before installing:

kubectl config current-context kubectl get nodes kubectl version --short

Installation

Quickstart script (development)

For a development cluster, the quickstart script installs KServe together with Knative, Istio, and cert-manager in one step. Always read a script before piping it to a shell.

# Inspect first curl -sL "https://raw.githubusercontent.com/kserve/kserve/release-0.14/hack/quickinstall.sh" -o quickinstall.sh less quickinstall.sh Then run bash quickinstall.sh

When it finishes, verify the control plane is healthy:

kubectl get pods -n kserve kubectl get pods -n knative-serving kubectl get pods -n istio-system

KServe Tutorial: Serverless Model Serving on Kubernetes

Serverless Model Serving on Kubernetes with KServe

What KServe Is

Relationship to Knative and Istio

Serverless mode

Standard (raw) mode

Prerequisites

Installation

Quickstart script (development)

Then run

Related Articles

Complete Kubeflow Tutorial: MLOps on Kubernetes

Complete BentoML Tutorial: Packaging and Serving ML Models to Production

Text Generation Inference (TGI) Tutorial: Production LLM Serving

Triton Inference Server Tutorial: High-Performance Model Serving

Related Articles

Complete Kubeflow Tutorial: MLOps on Kubernetes

Tutorial Lengkap Kubeflow: MLOps di Kubernetes Kubeflow adalah platform open-source untuk deploy, mengelola, dan scaling...

Complete BentoML Tutorial: Packaging and Serving ML Models to Production

Tutorial Lengkap BentoML: Packaging dan Serving ML Models ke Production BentoML adalah framework open-source untuk build...

Text Generation Inference (TGI) Tutorial: Production LLM Serving

Menyajikan LLM di Produksi dengan Text Generation Inference (TGI) Text Generation Inference (TGI) adalah toolkit buatan ...

Triton Inference Server Tutorial: High-Performance Model Serving

Tutorial 19: Triton Inference Server - Penyajian Model Berperforma Tinggi Daftar Isi Pendahuluan Prasyarat Menyiapkan Tr...